Zillow Scraping with Python: Extract Property Listings and Home Prices
Zillow Scraping with Python: Extract Property Listings and Home Prices
Source code: zillow_properties_for_sale_scraper
Table of Content
Introduction
Ethical Consideration
Scraping Workflow
Prerequisites
Project Setup
[PART 1] Scraping Zillow Data from the search page
Complete code for the first page
Get the information from the next page
Complete code for all pages
[PART 2 ] Scrape the other information from the Properties page
Complete code for the additional data
Complete code with the additional data with Proxy Rotation
Conclusion
Introduction
Zillow is a go-to platform for real estate data, featuring millions of property listings with detailed information on prices, locations, and home features. In this tutorial, we’ll guide you through the process of Zillow scraping using Python. You will learn how to extract essential property details such as home prices and geographic data, which will empower you to track market trends, analyze property values, and compare listings across various regions. This guide includes source code and techniques to effectively implement Zillow scraping.
In this tutorial, we will focus on collecting data for houses listed for sale in Nebraska. Our starting URL will be:
- Base URL: https://www.zillow.com/ne
The information that we want to scrape are:
- House URL
- Images
- Price
- Address
- Number of bedroom(s)
- Number of bathroom(s)
- House Size
- Lot Size
- House Type
- Year Built
- Description
- Listing Date
- Days on Zillow
- Total Views
- Total Saved
- Realtor Name
- Realtor Contact Number
- Agency
- Co-realtor Name
- Co-realtor contact number
- Co-realtor agency
Ethical Consideration
Before we dive into the technical aspects of scraping Zillow, it’s important to emphasize that this tutorial is intended for educational purposes only. When interacting with public servers, it’s vital to maintain a responsible approach. Here are some essential guidelines to keep in mind:
- Respect Website Performance: Avoid scraping at a speed that could negatively impact the website’s performance or availability.
- Public Data Only: Ensure that you only scrape data that is publicly accessible. Respect any restrictions set by the website.
- No Redistribution of Data: Refrain from redistributing entire public datasets, as this may violate legal regulations in certain jurisdictions.
Scraping Workflow
The Zillow scraper can be effectively divided into two parts, each focusing on different aspects of data extraction.
The first part involves extracting essential information from the Zillow search results page which consists of this information.
HOUSE URLs
, PHOTO URLs
, PRICE
, FULL ADDRESS
, STREET
, CITY
, STATE
, ZIP CODE
, NUMBER OF BEDROOMS
, NUMBER OF BATHROOMS
, HOUSE SIZE
, LOT SIZE
and HOUSE TYPE
It is important to note that while the search page provides a wealth of information, it does not display LOT SIZE
and HOUSE TYPE
directly. However, these values are accessible through the backend which I’ll show you later.
The second part is to scrape the rest of the information from the particular HOUSE URLs
page which includes:
YEAR BUILT
, DESCRIPTION
, LISTING DATE
, DAYS ON ZILLOW
, TOTAL VIEWS
, TOTAL SAVED
, REALTOR NAME
, REALTOR CONTACT NO
, AGENCY
, CO-REALTOR NAME
, CO-REALTOR CONTACT NO
and CO-REALTOR AGENCY
Prerequisites
Before starting this project, ensure you have the following:
- Python Installed: Make sure Python is installed on your machine.
- Proxy Usage: It is highly recommended to use a proxy for this project to avoid detection and potential blocking. For this tutorial, we will use a residential proxy from Rayobyte. You can sign up for a free trial that offers 50MB of usage without requiring a credit card.
Project Setup
- Create a new folder in your desired directory to house your project files.
- Open your terminal in the directory you just created and run the following command to install the necessary libraries:
pip install requests beautifulsoup4
3. If you are using a proxy, I suggest you install the python-dotenv package as well. To store your credentials in .env
file
pip install python-dotenv
4. Open your preferred code editor (for example, Visual Studio Code) and create a new file with the extension .ipynb
. This will create a new Jupyter notebook within VS Code.
[PART 1] Scraping Zillow Data from the search page
- House URL, Images, Price, Address, Number of bedroom(s), Number of bathroom(s), House Size, Lot Size and House Type
In this section, we will implement the code to scrape property data from Zillow. We will cover everything from importing libraries to saving the extracted information in a CSV file.
First, we need to import the libraries that will help us with HTTP requests and HTML parsing.
import requests from bs4 import BeautifulSoup import json
Setting headers helps disguise our request as if it’s coming from a real browser, which can help avoid detection.
headers = { "User-Agent": Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36, "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", }
If you have a proxy, include it in your requests to avoid potential blocks.
proxies = { 'http': 'http://username:password@host:port', 'https': 'http://username:password@host:port' }
Make sure to replace username, password, host, and port with your actual proxy credentials.
Or you can create a .env
file to store your proxy credentials and load you proxies like this
from dotenv import load_dotenv load_dotenv() proxy = os.getenv("PROXY") PROXIES = { 'http': f'http://{proxy}', 'https': f'http://{proxy}' }
Define the URL for the state you want to scrape—in this case, Nebraska.
url = "https://www.zillow.com/ne"
Send a GET request to the server using the headers and proxies defined earlier.
response = requests.get(url, headers=headers, proxies=proxies) # Use proxies if available # If you don't have a proxy: # response = requests.get(url, headers=headers)
Use BeautifulSoup to parse the HTML content of the page.
soup = BeautifulSoup(response.content, 'html.parser')
Extract House URLs from Listing Cards
The first thing that we want from the first landing page is to extract all the House URLs. Normally these URLs are available inside the “listing cards”.
Inspecting the “Listing Card”
To inspect the element, “right-click” anywhere and click on “Inspect” or simply press F12. Click on this arrow icon and start hovering on the element that we want.
listing_card = soup.find_all('li', class_='ListItem-c11n-8-105-0__sc-13rwu5a-0') print(len(listing_card))
As we see here there are 42 listings inside this first page.
Now, let’s try getting the url. If we expand the li
tag we will notice there are a
tag and the url is in the href
attribute.:
To get this value, let’s test by extracting inside the first listing. Therefore, we need to specify that we want the information from the first only.
card = listing_card[0]
house_url = card.find('a').get('href') print('URL:', house_url)
This works fine until here. However, as you may know or not, Zillow has strong anti-bot detection mechanisms. Therefore by using this method, you’ll get the response to 10 urls only instead of 42, which is the total listing that appears on the first page.
Overcome Anti-Bot Detection by Extracting the data from JSON format
To overcome this issue, I found another approach by using the “Javascript-rendered” value that returns from the web page. If we scroll down the “inspect” page, we will find a script tag with the id=”__NEXT_DATA__”
content = soup.find('script', id='__NEXT_DATA__')
Convert the content to json format.
import json
json_content = content.string data = json.loads(json_content)
Save this JSON data for easier inspection later:
with open('output.json', 'w') as json_file: json.dump(data, json_file, indent=4)
After running this code, you’ll get the output.json
file inside your folder.
Open the file to locate the URL. I’m using ctrl+f
to find the URL location inside my VScode.
Notice here the URL is inside the "detailUrl"
. Apart from that, it returns other useful information as well.
To extract the value inside this json file:
house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
Get the first listing
detail = house_details[0]
house_url = detail['detailUrl'] house_url
We get the same value as before.
To get all the URLs from the first page:
house_urls = [detail['detailUrl'] for detail in house_details]
By using this method, we are able to get all the URL.
As we inspect our json file, we can see other information that we’re interested in as well. So let’s get these values from here.
Image URLs
photo_urls = [photo['url'] for photo in detail['carouselPhotos']]
price = detail['price'] full_address = detail['address'] address_street = detail['addressStreet'] city = detail['addressCity'] state = detail['addressState'] zipcode = detail['addressZipcode'] home_info = detail['hdpData']['homeInfo'] bedrooms = home_info['bedrooms'] bathrooms = home_info['bathrooms'] house_size = home_info['livingArea'] lot_size = home_info['lotAreaValue'] house_type = home_info['homeType']
Save all the information in CSV file
import csv # Open a new CSV file for writing with open('house_details.csv', 'w', newline='', encoding='utf-8') as csvfile: # Create a CSV writer object csvwriter = csv.writer(csvfile) # Write the header row csvwriter.writerow(['HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE', 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOM', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE']) # Iterate through the house details and write each row for detail in house_details: house_url = detail['detailUrl'] photo_urls = ','.join([photo['url'] for photo in detail['carouselPhotos']]) price = detail['price'] full_address = detail['address'] address_street = detail['addressStreet'] city = detail['addressCity'] state = detail['addressState'] zipcode = detail['addressZipcode'] home_info = detail['hdpData']['homeInfo'] bedrooms = home_info['bedrooms'] bathrooms = home_info['bathrooms'] house_size = home_info['livingArea'] lot_size = home_info['lotAreaValue'] lot_unit = home_info['lotAreaUnit'] house_type = home_info['homeType'] # Write the row to the CSV file csvwriter.writerow([house_url, photo_urls, price, full_address, address_street, city, state, zipcode, bedrooms, bathrooms, house_size, f'{lot_size} {lot_unit}', house_type]) print("Data has been saved to house_details.csv")
This is all the output from the first page which is 41 in total.
Complete code for the first page
import os import requests from bs4 import BeautifulSoup import json import csv from dotenv import load_dotenv load_dotenv() # Define headers for the HTTP request HEADERS = { "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36', "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } # Define proxy settings (if needed) proxy = os.getenv("PROXY") PROXIES = { 'http': f'http://{proxy}', 'https': f'http://{proxy}' } def fetch_data(url): try: response = requests.get(url, headers=HEADERS, proxies=PROXIES) response.raise_for_status() return response.content except requests.RequestException as e: print(f"Error fetching data: {e}") return None def parse_data(content): soup = BeautifulSoup(content, 'html.parser') script_content = soup.find('script', id='__NEXT_DATA__') if script_content: json_content = script_content.string return json.loads(json_content) else: print("Could not find the required script tag.") return None def save_to_csv(house_details, output_file): with open(output_file, 'w', newline='', encoding='utf-8') as csvfile: csvwriter = csv.writer(csvfile) csvwriter.writerow([ 'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE', 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE' ]) for detail in house_details: home_info = detail['hdpData']['homeInfo'] photo_urls = ','.join([photo['url'] for photo in detail['carouselPhotos']]) # Concatenate lot area value and unit lot_size = f"{home_info.get('lotAreaValue')} {home_info.get('lotAreaUnit')}" csvwriter.writerow([ detail['detailUrl'], photo_urls, detail['price'], detail['address'], detail['addressStreet'], detail['addressCity'], detail['addressState'], detail['addressZipcode'], home_info.get('bedrooms'), home_info.get('bathrooms'), home_info.get('livingArea'), lot_size, home_info.get('homeType').replace('_', ' ') ]) def main(): URL = "https://www.zillow.com/ne" content = fetch_data(URL) output_directory = 'OUTPUT_1' os.makedirs(output_directory, exist_ok=True) file_name = 'house_details_first_page.csv' output_file = os.path.join(output_directory, file_name) if content: data = parse_data(content) if data: house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults'] save_to_csv(house_details, output_file) print(f"Data has been saved to {output_file}") if __name__ == "__main__": main()
After running this code, it will create a new folder named OUTPUT_1
and you’ll find the file name house_details_first_page.csv
inside it.
Get the information from the next page
First, take a look at the URLs for the pages we want to scrape:
- First Page: https://www.zillow.com/ne
- Second Page: https://www.zillow.com/ne/2_p
- Third Page: https://www.zillow.com/ne/3_p
Notice how the page number increments by 1 with each subsequent page.
To automate the scraping process, we will utilize a while loop that iterates through all the pages. Here’s how we can set it up:
base_url = "https://www.zillow.com/ne" page = 1 max_pages = 10 # Adjust this to scrape more pages, or set to None for all pages while max_pages is None or page <= max_pages: if page == 1: url = base_url else: url = f"{base_url}/{page}_p"
Complete code for all pages
Here’s the complete code to scrape all the pages.
Below is the complete code that scrapes all specified pages. We will also use tqdm to monitor our scraping progress. To install tqdm, run:
pip install tqdm
Additionally, we’ll implement logging to capture any errors during execution. A log file named scraper.log
will be created to store these logs.
Important Notes
- The current setup limits scraping to 5 pages. To modify this scraper to extract data from all available pages, simply change
max_pages
online 101
to None. - Don’t forget to update your proxy credentials as necessary.
import os import requests from bs4 import BeautifulSoup import json import csv import time import logging from tqdm import tqdm from dotenv import load_dotenv load_dotenv() # Set up logging logging.basicConfig(filename='scraper.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') # Define headers for the HTTP request HEADERS = { "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36', "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } # Define proxy settings (if needed) proxy = os.getenv("PROXY") PROXIES = { 'http': f'http://{proxy}', 'https': f'http://{proxy}' } def fetch_data(url): try: response = requests.get(url, headers=HEADERS, proxies=PROXIES) response.raise_for_status() return response.content except requests.RequestException as e: logging.error(f"Error fetching data: {e}") return None def parse_data(content): try: soup = BeautifulSoup(content, 'html.parser') script_content = soup.find('script', id='__NEXT_DATA__') if script_content: json_content = script_content.string return json.loads(json_content) else: logging.error("Could not find the required script tag.") return None except json.JSONDecodeError as e: logging.error(f"Error parsing JSON: {e}") return None def save_to_csv(house_details, output_file, mode='a'): with open(output_file, mode, newline='', encoding='utf-8') as csvfile: csvwriter = csv.writer(csvfile) if mode == 'w': csvwriter.writerow([ 'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE', 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE' ]) for detail in tqdm(house_details, desc="Saving house details", unit="house"): try: home_info = detail.get('hdpData', {}).get('homeInfo', {}) photo_urls = ','.join([photo.get('url', '') for photo in detail.get('carouselPhotos', [])]) # Concatenate lot area value and unit lot_size = f"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}" csvwriter.writerow([ detail.get('detailUrl', ''), photo_urls, detail.get('price', ''), detail.get('address', ''), detail.get('addressStreet', ''), detail.get('addressCity', ''), detail.get('addressState', ''), detail.get('addressZipcode', ''), home_info.get('bedrooms', ''), home_info.get('bathrooms', ''), home_info.get('livingArea', ''), lot_size, home_info.get('homeType', '').replace('_', ' ') ]) except Exception as e: logging.error(f"Error processing house detail: {e}") logging.error(f"Problematic detail: {detail}") def main(): base_url = "https://www.zillow.com/ne" page = 1 max_pages = 5 # Set this to the number of pages you want to scrape, or None for all pages output_directory = 'OUTPUT_1' os.makedirs(output_directory, exist_ok=True) file_name = f'house_details-1-{max_pages}.csv' output_file = os.path.join(output_directory, file_name) with tqdm(total=max_pages, desc="Scraping pages", unit="page") as pbar: while max_pages is None or page <= max_pages: if page == 1: url = base_url else: url = f"{base_url}/{page}_p" logging.info(f"Scraping page {page}: {url}") content = fetch_data(url) if content: data = parse_data(content) if data: try: house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults'] if house_details: save_to_csv(house_details, output_file, mode='a' if page > 1 else 'w') logging.info( f"Data from page {page} has been saved to house_details-1-10.csv") else: logging.info( f"No more results found on page {page}. Stopping.") break except KeyError as e: logging.error(f"KeyError on page {page}: {e}") logging.error(f"Data structure: {data}") break else: logging.error( f"Failed to parse data from page {page}. Stopping.") break else: logging.error( f"Failed to fetch data from page {page}. Stopping.") break page += 1 pbar.update(1) # Add a delay between requests to be respectful to the server time.sleep(5) logging.info("Scraping completed.") if __name__ == "__main__": main()
[PART 2 ] Scrape the other information from the Properties page
- Year Built, Description, Listing Date, Days on Zillow, Total Views, Total Saved, Realtor Name, Realtor Contact Number, Agency, Co-realtor Name, Co-realtor contact number, Co-realtor agency
To extract additional information from a Zillow property listing that is not available directly on the search results page, we need to send a GET request to the specific HOUSE URL
. This will allow us to gather details such as the year built, description, listing updated date, realtor information, number of views, and number of saves.
First, we will define the HOUSE URL
from which we want to extract the additional information. This URL may vary depending on the specific property you are scraping.
house_url = 'https://www.zillow.com/homedetails/7017-S-132nd-Ave-Omaha-NE-68138/58586050_zpid/' response = requests.get(house_url, headers=HEADERS, proxies=PROXIES) soup = BeautifulSoup(response.content, 'html.parser')
Since we already have the image urls we will be focusing inside this container which holds the relevant data for extraction.
content = soup.find('div', class_='ds-data-view-list')
Now let’s extract the Year Built:
Since there are a few other elements with the same span tag and class name, we’re going to be more specific by finding the element with the text “Built in”
year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: "Built in" in text) year_built = year.text.strip().replace('Built in ', '') year_built
The property description can be found within a specific div tag identified by its data-testid
.
description = content.find('div', attrs={'data-testid': 'description'}).text.strip() description
If we notice at the end of the code there is a ‘Show more’ string. So let’s remove this by replacing this string with empty string.
description = content.find('div', attrs={'data-testid': 'description'}).text.strip().replace('Show more','')
Get the listing date:
Similar to extracting the year built, we will find the listing updated date using a specific class name and filtering for relevant text.
listing_details = content.find_all('p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and "Listing updated" in text) date_details = listing_details[0].text.strip() date_details = listing_details[0].text.strip() date_part = date_details.split(' at ')[0] listing_date = date_part.replace('Listing updated: ', '').strip()
Get the days on Zillow, total views and total saved
These values can be found within dt
tags. We will extract them based on their positions.
containers = content.find_all('dt')
days_on_zillow = containers[0].text.strip() views = containers[2].text.strip() total_save = containers[4].text.strip()
Finally, we will extract information about the realtor and their agency from specific p
tags.
If we expand the p
tag, we can see the values that we want inside it.
realtor_content = content.find('p', attrs={'data-testid': 'attribution-LISTING_AGENT'}).text.strip().replace(',', '') print('REALTOR:', realtor_content)
As we see from the output above, the realtor’s name and contact number are inside the same ‘element’ so let’s separate them to make our data look nice and clean.
name, contact = realtor_content.split('M:') realtor_name = name.strip() realtor_contact = contact.strip() print('REALTOR NAME:', realtor_name) print('REALTOR CONTACT NO:', realtor_contact)
agency_name = content.find('p', attrs={'data-testid': 'attribution-BROKER'}).text.strip().replace(',', '') print('OFFICE:', agency_name)
co_realtor_content = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'}).text.strip().replace(',', '') print('CO-REALTOR CONTENT:', co_realtor_content)
Same as before we need to split the name and contact number.
name_contact = co_realtor_content.rsplit(' ', 1) name = name_contact[0] contact = name_contact[1] co_realtor_name = name.strip() co_realtor_contact = contact.strip() print(f"CO-REALTOR NAME: {co_realtor_name}") print(f"CO-REALTOR CONTACT NO: {co_realtor_contact}")
co_realtor_agency_name = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'}).text.strip() print('CO-REALTOR AGENCY NAME:', co_realtor_agency_name)
Complete code with the additional data
Let’s enhance our data collection process by creating a new Python file dedicated to fetching additional information. This script will first read the HOUSE URLs
from the existing CSV file, sending requests for each URL to extract valuable data. Once all information is gathered, it will save the results in a new CSV file, preserving the original data for reference.
import os import requests from bs4 import BeautifulSoup import json import csv import time import logging from tqdm import tqdm from dotenv import load_dotenv load_dotenv() # Set up logging logging.basicConfig(filename='scraper.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') # Define headers for the HTTP request HEADERS = { "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36', "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } # Define proxy settings (if needed) proxy = os.getenv("PROXY") PROXIES = { 'http': f'http://{proxy}', 'https': f'http://{proxy}' } def fetch_data(url): try: response = requests.get(url, headers=HEADERS, proxies=PROXIES) response.raise_for_status() return response.content except requests.RequestException as e: logging.error(f"Error fetching data: {e}") return None def parse_data(content): try: soup = BeautifulSoup(content, 'html.parser') script_content = soup.find('script', id='__NEXT_DATA__') if script_content: json_content = script_content.string return json.loads(json_content) else: logging.error("Could not find the required script tag.") return None except json.JSONDecodeError as e: logging.error(f"Error parsing JSON: {e}") return None def save_to_csv(house_details, mode='a'): output_directory = 'OUTPUT_1' os.makedirs(output_directory, exist_ok=True) file_name = 'house_details-1-5.csv' # Change accordingly output_file = os.path.join(output_directory, file_name) with open(output_file, mode, newline='', encoding='utf-8') as csvfile: csvwriter = csv.writer(csvfile) if mode == 'w': csvwriter.writerow([ 'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE', 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE' ]) for detail in tqdm(house_details, desc="Saving house details", unit="house"): try: home_info = detail.get('hdpData', {}).get('homeInfo', {}) photo_urls = ','.join([photo.get('url', '') for photo in detail.get('carouselPhotos', [])]) # Concatenate lot area value and unit lot_size = f"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}" csvwriter.writerow([ detail.get('detailUrl', ''), photo_urls, detail.get('price', ''), detail.get('address', ''), detail.get('addressStreet', ''), detail.get('addressCity', ''), detail.get('addressState', ''), detail.get('addressZipcode', ''), home_info.get('bedrooms', ''), home_info.get('bathrooms', ''), home_info.get('livingArea', ''), lot_size, home_info.get('homeType', '').replace('_', ' ') ]) except Exception as e: logging.error(f"Error processing house detail: {e}") logging.error(f"Problematic detail: {detail}") def main(): base_url = "https://www.zillow.com/ne" page = 1 max_pages = 5 # Set this to the number of pages you want to scrape, or None for all pages with tqdm(total=max_pages, desc="Scraping pages", unit="page") as pbar: while max_pages is None or page <= max_pages: if page == 1: url = base_url else: url = f"{base_url}/{page}_p" logging.info(f"Scraping page {page}: {url}") content = fetch_data(url) if content: data = parse_data(content) if data: try: house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults'] if house_details: save_to_csv(house_details, mode='a' if page > 1 else 'w') logging.info( f"Data from page {page} has been saved to house_details-1-10.csv") else: logging.info( f"No more results found on page {page}. Stopping.") break except KeyError as e: logging.error(f"KeyError on page {page}: {e}") logging.error(f"Data structure: {data}") break else: logging.error( f"Failed to parse data from page {page}. Stopping.") break else: logging.error( f"Failed to fetch data from page {page}. Stopping.") break page += 1 pbar.update(1) # Add a delay between requests to be respectful to the server time.sleep(5) logging.info("Scraping completed.") if __name__ == "__main__": main()
Why Create a New File?
The decision to generate a new file instead of overwriting the previous one serves as a safeguard. This approach ensures that we have a backup in case our code encounters issues or if access is blocked, allowing us to maintain data integrity throughout the process.
By implementing this strategy, we not only enhance our data collection capabilities but also ensure that we can troubleshoot effectively without losing any valuable information.
.
Complete code for the additional data with Proxy Rotation
Implementing proxy rotation is essential for avoiding anti-bot detection, especially when making numerous requests to a website. In this tutorial, we will demonstrate how to gather additional data from Zillow property listings while utilizing proxies from Rayobyte, which offers 50MB of residential proxy traffic for free upon signup..
Download and Prepare the Proxy List
Sign Up for Rayobyte: Create an account on Rayobyte to access their proxy services.
Generate Proxy List:
- Navigate to the “Proxy List Generator” in your dashboard.
- Set the format to username:password@hostname:port.
- Download the proxy list.
Move the Proxy File: Locate the downloaded file in your downloads directory and move it to your code directory.
Implement Proxy Rotation in Your Code
import os import requests from bs4 import BeautifulSoup from dotenv import load_dotenv import pandas as pd import random import time import logging from tqdm import tqdm load_dotenv() # Set up logging logging.basicConfig(filename='scraper.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S') HEADERS = { "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36', "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } def load_proxies(file_path): with open(file_path, 'r') as f: return [line.strip() for line in f if line.strip()] PROXY_LIST = load_proxies('proxy-list.txt') def get_random_proxy(): return random.choice(PROXY_LIST) def get_proxies(proxy): return { 'http': f'http://{proxy}', 'https': f'http://{proxy}' } def scrape_with_retry(url, max_retries=3): for attempt in range(max_retries): proxy = get_random_proxy() proxies = get_proxies(proxy) try: response = requests.get( url, headers=HEADERS, proxies=proxies, timeout=30) if response.status_code == 200: return response else: logging.warning( f"Attempt {attempt + 1} failed with status code {response.status_code} for URL: {url}") except requests.RequestException as e: logging.error( f"Attempt {attempt + 1} failed with error: {e} for URL: {url}") time.sleep(random.uniform(1, 3)) logging.error( f"Failed to fetch data for {url} after {max_retries} attempts.") return None def scrape_house_data(house_url): response = scrape_with_retry(house_url) if not response: return None soup = BeautifulSoup(response.content, 'html.parser') content = soup.find('div', class_='ds-data-view-list') if not content: logging.error(f"Failed to find content for {house_url}") return None year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: "Built in" in text) year_built = year.text.strip().replace('Built in ', '') if year else "N/A" description_elem = content.find( 'div', attrs={'data-testid': 'description'}) description = description_elem.text.strip().replace( 'Show more', '') if description_elem else "N/A" listing_details = content.find_all( 'p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and "Listing updated" in text) listing_date = "N/A" if listing_details: date_details = listing_details[0].text.strip() date_part = date_details.split(' at ')[0] listing_date = date_part.replace('Listing updated: ', '').strip() containers = content.find_all('dt') days_on_zillow = containers[0].text.strip() if len( containers) > 0 else "N/A" views = containers[2].text.strip() if len(containers) > 2 else "N/A" total_save = containers[4].text.strip() if len(containers) > 4 else "N/A" realtor_elem = content.find( 'p', attrs={'data-testid': 'attribution-LISTING_AGENT'}) if realtor_elem: realtor_content = realtor_elem.text.strip().replace(',', '') if 'M:' in realtor_content: name, contact = realtor_content.split('M:') else: name_contact = realtor_content.rsplit(' ', 1) name = name_contact[0] contact = name_contact[1] realtor_name = name.strip() realtor_contact = contact.strip() else: realtor_name = "N/A" realtor_contact = "N/A" agency_elem = content.find( 'p', attrs={'data-testid': 'attribution-BROKER'}) agency_name = agency_elem.text.strip().replace(',', '') if agency_elem else "N/A" co_realtor_elem = content.find( 'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'}) if co_realtor_elem: co_realtor_content = co_realtor_elem.text.strip().replace(',', '') if 'M:' in co_realtor_content: name, contact = co_realtor_content.split('M:') else: name_contact = co_realtor_content.rsplit(' ', 1) name = name_contact[0] contact = name_contact[1] co_realtor_name = name.strip() co_realtor_contact = contact.strip() else: co_realtor_name = "N/A" co_realtor_contact = "N/A" co_realtor_agency_elem = content.find( 'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'}) co_realtor_agency_name = co_realtor_agency_elem.text.strip( ) if co_realtor_agency_elem else "N/A" return { 'YEAR BUILT': year_built, 'DESCRIPTION': description, 'LISTING DATE': listing_date, 'DAYS ON ZILLOW': days_on_zillow, 'TOTAL VIEWS': views, 'TOTAL SAVED': total_save, 'REALTOR NAME': realtor_name, 'REALTOR CONTACT NO': realtor_contact, 'AGENCY': agency_name, 'CO-REALTOR NAME': co_realtor_name, 'CO-REALTOR CONTACT NO': co_realtor_contact, 'CO-REALTOR AGENCY': co_realtor_agency_name } def ensure_output_directory(directory): if not os.path.exists(directory): os.makedirs(directory) logging.info(f"Created output directory: {directory}") def load_progress(output_file): if os.path.exists(output_file): return pd.read_csv(output_file) return pd.DataFrame() def save_progress(df, output_file): df.to_csv(output_file, index=False) logging.info(f"Progress saved to {output_file}") def main(): input_file = './OUTPUT_1/house_details.csv' output_directory = 'OUTPUT_2' file_name = 'house_details_scraped.csv' output_file = os.path.join(output_directory, file_name) ensure_output_directory(output_directory) df = pd.read_csv(input_file) # Load existing progress result_df = load_progress(output_file) # Determine which URLs have already been scraped scraped_urls = set(result_df['HOUSE URL'] ) if 'HOUSE URL' in result_df.columns else set() # Scrape data for each house URL for _, row in tqdm(df.iterrows(), total=df.shape[0], desc="Scraping Progress"): house_url = row['HOUSE URL'] # Skip if already scraped if house_url in scraped_urls: continue logging.info(f"Scraping data for {house_url}") data = scrape_house_data(house_url) if data: # Combine the original row data with the scraped data combined_data = {**row.to_dict(), **data} new_row = pd.DataFrame([combined_data]) # Append the new row to the result DataFrame result_df = pd.concat([result_df, new_row], ignore_index=True) # Save progress after each successful scrape save_progress(result_df, output_file) # Add a random delay between requests (1 to 5 seconds) time.sleep(random.uniform(1, 5)) logging.info(f"Scraping completed. Final results saved to {output_file}") print( f"Scraping completed. Check {output_file} for results and scraper.log for detailed logs.") if __name__ == "__main__": main()
Conclusion
In conclusion, this comprehensive guide on Zillow scraping with Python has equipped you with essential tools and techniques to effectively extract property listings and home prices. By following the outlined steps, you have learned how to navigate the complexities of web scraping, including overcoming anti-bot measures and utilizing proxies for seamless data retrieval.
Key takeaways from this tutorial include:
- Understanding the Ethical Considerations: Emphasizing responsible scraping practices to respect website performance and legal guidelines.
- Scraping Workflow: Dividing the scraping process into manageable parts for clarity and efficiency.
- Technical Implementation: Utilizing Python libraries such as requests, BeautifulSoup, and json for data extraction.
- Data Storage: Saving extracted information in CSV format for easy access and analysis.
As you implement these strategies, you will gain valuable insights into real estate trends and market dynamics, empowering you to make informed decisions based on the data collected. With the provided source code and detailed explanations, you are now well-prepared to adapt this project to your specific needs, whether that involves expanding your data collection or refining your analysis techniques. Embrace the power of data-driven insights as you explore the vast landscape of real estate information available through platforms like Zillow. Drop a comment below if you have any questions and Happy scraping!
Source code: zillow_properties_for_sale_scraper
Video: Extract data from Zillow properties for sale listing using Python
Responses