Zillow Scraping with Python: Extract Property Listings and Home Prices

Zillow Scraping with Python: Extract Property Listings and Home Prices

Source code: zillow_properties_for_sale_scraper 

Table of Content

Introduction
Ethical Consideration
Scraping Workflow
Prerequisites
Project Setup
[PART 1] Scraping Zillow Data from the search page
Complete code for the first page
Get the information from the next page
Complete code for all pages
[PART 2 ] Scrape the other information from the Properties page
Complete code for the additional data
Complete code with the additional data with Proxy Rotation
Conclusion

Introduction

Zillow is a go-to platform for real estate data, featuring millions of property listings with detailed information on prices, locations, and home features. In this tutorial, we’ll guide you through the process of Zillow scraping using Python.  You will learn how to extract essential property details such as home prices and geographic data, which will empower you to track market trends, analyze property values, and compare listings across various regions. This guide includes source code and techniques to effectively implement Zillow scraping.

In this tutorial, we will focus on collecting data for houses listed for sale in Nebraska. Our starting URL will be:

The information that we want to scrape are:

  • House URL
  • Images
  • Price
  • Address
  • Number of bedroom(s)
  • Number of bathroom(s)
  • House Size
  • Lot Size
  • House Type
  • Year Built
  • Description
  • Listing Date
  • Days on Zillow
  • Total Views
  • Total Saved
  • Realtor Name
  • Realtor Contact Number
  • Agency
  • Co-realtor Name
  • Co-realtor contact number
  • Co-realtor agency

Ethical Consideration

Before we dive into the technical aspects of scraping Zillow, it’s important to emphasize that this tutorial is intended for educational purposes only. When interacting with public servers, it’s vital to maintain a responsible approach. Here are some essential guidelines to keep in mind:

  • Respect Website Performance: Avoid scraping at a speed that could negatively impact the website’s performance or availability.
  • Public Data Only: Ensure that you only scrape data that is publicly accessible. Respect any restrictions set by the website.
  • No Redistribution of Data: Refrain from redistributing entire public datasets, as this may violate legal regulations in certain jurisdictions.

Scraping Workflow

The Zillow scraper can be effectively divided into two parts, each focusing on different aspects of data extraction.

The first part involves extracting essential information from the Zillow search results page which consists of this information.

HOUSE URLs, PHOTO URLs, PRICE, FULL ADDRESS, STREET, CITY, STATE, ZIP CODE, NUMBER OF BEDROOMS, NUMBER OF BATHROOMS, HOUSE SIZE, LOT SIZE and HOUSE TYPE

search page

It is important to note that while the search page provides a wealth of information, it does not display LOT SIZE and HOUSE TYPE directly. However, these values are accessible through the backend which I’ll show you later.

The second part is to scrape the rest of the information from the particular HOUSE URLs page which includes:

YEAR BUILT, DESCRIPTION, LISTING DATE, DAYS ON ZILLOW, TOTAL VIEWSTOTAL SAVED, REALTOR NAME, REALTOR CONTACT NO, AGENCY, CO-REALTOR NAME, CO-REALTOR CONTACT NO and CO-REALTOR AGENCY

House page

Prerequisites

Before starting this project, ensure you have the following:

  • Python Installed: Make sure Python is installed on your machine.
  • Proxy Usage: It is highly recommended to use a proxy for this project to avoid detection and potential blocking. For this tutorial, we will use a residential proxy from Rayobyte. You can sign up for a free trial that offers 50MB of usage without requiring a credit card.

Project Setup

  1. Create a new folder in your desired directory to house your project files.
  2. Open your terminal in the directory you just created and run the following command to install the necessary libraries:
pip install requests beautifulsoup4

3. If you are using a proxy, I suggest you install the python-dotenv package as well.  To store your credentials in .env file

pip install python-dotenv

4. Open your preferred code editor (for example, Visual Studio Code) and create a new file with the extension .ipynb. This will create a new Jupyter notebook within VS Code.

[PART 1] Scraping Zillow Data from the search page

  • House URL, Images, Price, Address, Number of bedroom(s), Number of bathroom(s), House Size, Lot Size and House Type

In this section, we will implement the code to scrape property data from Zillow. We will cover everything from importing libraries to saving the extracted information in a CSV file.

First, we need to import the libraries that will help us with HTTP requests and HTML parsing.

import requests
from bs4 import BeautifulSoup
import json

Setting headers helps disguise our request as if it’s coming from a real browser, which can help avoid detection.

headers = {
    "User-Agent": Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36,
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

If you have a proxy, include it in your requests to avoid potential blocks.

proxies = {
    'http': 'http://username:password@host:port',
    'https': 'http://username:password@host:port'
}

Make sure to replace username, password, host, and port with your actual proxy credentials.

Or you can create a .env file to store your proxy credentials and load you proxies like this

from dotenv import load_dotenv
load_dotenv()

proxy = os.getenv("PROXY")
PROXIES = {
    'http': f'http://{proxy}',
    'https': f'http://{proxy}'
}

Define the URL for the state you want to scrape—in this case, Nebraska.

url = "https://www.zillow.com/ne"

Send a GET request to the server using the headers and proxies defined earlier.

response = requests.get(url, headers=headers, proxies=proxies)  # Use proxies if available
# If you don't have a proxy:
# response = requests.get(url, headers=headers)

Use BeautifulSoup to parse the HTML content of the page.

soup = BeautifulSoup(response.content, 'html.parser')

Extract House URLs from Listing Cards

The first thing that we want from the first landing page is to extract all the House URLs. Normally these URLs are available inside the “listing cards”.

Listing card

Inspecting the “Listing Card”

To inspect the element, “right-click” anywhere and click on “Inspect” or simply press F12. Click on this arrow icon and start hovering on the element that we want.

Hover arrow icon listing card element

listing_card = soup.find_all('li', class_='ListItem-c11n-8-105-0__sc-13rwu5a-0')
print(len(listing_card))

listing len

As we see here there are 42 listings inside this first page.

Now, let’s try getting the url. If we expand the li tag we will notice there are a tag and the url is in the href attribute.:

House url html tag

To get this value, let’s test by extracting inside the first listing. Therefore, we need to specify that we want the information from the first only.

card = listing_card[0]
house_url = card.find('a').get('href')
print('URL:', house_url)

house url result

This works fine until here. However, as you may know or not, Zillow has strong anti-bot detection mechanisms. Therefore by using this method, you’ll get the response to 10 urls only instead of 42, which is the total listing that appears on the first page. 

Overcome Anti-Bot Detection by Extracting the data from JSON format

To overcome this issue, I found another approach by using the “Javascript-rendered” value that returns from the web page. If we scroll down the “inspect” page, we will find a script tag with the id=”__NEXT_DATA__”

content = soup.find('script', id='__NEXT_DATA__')

Convert the content to json format.

import json
json_content = content.string
data = json.loads(json_content)

Save this JSON data for easier inspection later:

with open('output.json', 'w') as json_file:
    json.dump(data, json_file, indent=4)

After running this code, you’ll get the output.json file inside your folder. 

Open the file to locate the URL. I’m using ctrl+f to find the URL location inside my VScode.

json output

Notice here the URL is inside the "detailUrl". Apart from that, it returns other useful information as well. 

To extract the value inside this json file:

house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']

Get the first listing

detail = house_details[0]
house_url = detail['detailUrl']
house_url

We get the same value as before.

To get all the URLs from the first page:

house_urls = [detail['detailUrl'] for detail in house_details]

By using this method, we are able to get all the URL.

all house url

As we inspect our json file, we can see other information that we’re interested in as well. So let’s get these values from here. 

Image URLs

photo_urls = [photo['url'] for photo in detail['carouselPhotos']]

Photos URLs

price = detail['price']
full_address = detail['address']
address_street = detail['addressStreet']
city = detail['addressCity']
state = detail['addressState']
zipcode = detail['addressZipcode']
home_info = detail['hdpData']['homeInfo']
bedrooms = home_info['bedrooms']
bathrooms = home_info['bathrooms']
house_size = home_info['livingArea']
lot_size = home_info['lotAreaValue']
house_type = home_info['homeType']

Save all the information in CSV file

import csv

# Open a new CSV file for writing
with open('house_details.csv', 'w', newline='', encoding='utf-8') as csvfile:
    # Create a CSV writer object
    csvwriter = csv.writer(csvfile)
   
    # Write the header row
    csvwriter.writerow(['HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE',
                        'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOM', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'])
   
    # Iterate through the house details and write each row
    for detail in house_details:
        house_url = detail['detailUrl']
        photo_urls = ','.join([photo['url'] for photo in detail['carouselPhotos']])
        price = detail['price']
        full_address = detail['address']
        address_street = detail['addressStreet']
        city = detail['addressCity']
        state = detail['addressState']
        zipcode = detail['addressZipcode']
        home_info = detail['hdpData']['homeInfo']
        bedrooms = home_info['bedrooms']
        bathrooms = home_info['bathrooms']
        house_size = home_info['livingArea']
        lot_size = home_info['lotAreaValue']        lot_unit = home_info['lotAreaUnit']
        house_type = home_info['homeType']
       
        # Write the row to the CSV file
        csvwriter.writerow([house_url, photo_urls, price, full_address, address_street, city, state, zipcode, bedrooms, bathrooms, house_size, f'{lot_size} {lot_unit}', house_type])

print("Data has been saved to house_details.csv")

This is all the output from the first page which is 41 in total.

Complete code for the first page

import os
import requests
from bs4 import BeautifulSoup
import json
import csv
from dotenv import load_dotenv

load_dotenv()

# Define headers for the HTTP request
HEADERS = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

# Define proxy settings (if needed)
proxy = os.getenv("PROXY")

PROXIES = {
    'http': f'http://{proxy}',
    'https': f'http://{proxy}'
}


def fetch_data(url):
    try:
        response = requests.get(url, headers=HEADERS, proxies=PROXIES)
        response.raise_for_status()
        return response.content
    except requests.RequestException as e:
        print(f"Error fetching data: {e}")
        return None


def parse_data(content):
    soup = BeautifulSoup(content, 'html.parser')
    script_content = soup.find('script', id='__NEXT_DATA__')

    if script_content:
        json_content = script_content.string
        return json.loads(json_content)
    else:
        print("Could not find the required script tag.")
        return None


def save_to_csv(house_details, output_file):
    with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
        csvwriter = csv.writer(csvfile)

        csvwriter.writerow([
            'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',
            'STREET', 'CITY', 'STATE', 'ZIP CODE',
            'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',
            'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'
        ])

        for detail in house_details:
            home_info = detail['hdpData']['homeInfo']
            photo_urls = ','.join([photo['url']
                                  for photo in detail['carouselPhotos']])

            # Concatenate lot area value and unit
            lot_size = f"{home_info.get('lotAreaValue')} {home_info.get('lotAreaUnit')}"

            csvwriter.writerow([
                detail['detailUrl'],
                photo_urls,
                detail['price'],
                detail['address'],
                detail['addressStreet'],
                detail['addressCity'],
                detail['addressState'],
                detail['addressZipcode'],
                home_info.get('bedrooms'),
                home_info.get('bathrooms'),
                home_info.get('livingArea'),
                lot_size,
                home_info.get('homeType').replace('_', ' ')
            ])


def main():
    URL = "https://www.zillow.com/ne"
    content = fetch_data(URL)

    output_directory = 'OUTPUT_1'
    os.makedirs(output_directory, exist_ok=True)
    file_name = 'house_details_first_page.csv'
    output_file = os.path.join(output_directory, file_name)

    if content:
        data = parse_data(content)
        if data:
            house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
            save_to_csv(house_details, output_file)
            print(f"Data has been saved to {output_file}")


if __name__ == "__main__":
    main()

After running this code, it will create a new folder named OUTPUT_1 and you’ll find the file name house_details_first_page.csv inside it.

Get the information from the next page

First, take a look at the URLs for the pages we want to scrape:

  • First Page: https://www.zillow.com/ne 
  • Second Page: https://www.zillow.com/ne/2_p
  • Third Page: https://www.zillow.com/ne/3_p 

Notice how the page number increments by 1 with each subsequent page.

To automate the scraping process, we will utilize a while loop that iterates through all the pages. Here’s how we can set it up:

base_url = "https://www.zillow.com/ne"
page = 1
max_pages = 10  # Adjust this to scrape more pages, or set to None for all pages

while max_pages is None or page <= max_pages:
    if page == 1:
        url = base_url
    else:
        url = f"{base_url}/{page}_p"

Complete code for all pages

Here’s the complete code to scrape all the pages.

Below is the complete code that scrapes all specified pages. We will also use tqdm to monitor our scraping progress. To install tqdm, run:

pip install tqdm

Additionally, we’ll implement logging to capture any errors during execution. A log file named scraper.log will be created to store these logs.

Important Notes

  • The current setup limits scraping to 5 pages. To modify this scraper to extract data from all available pages, simply change max_pages on line 101 to None.
  • Don’t forget to update your proxy credentials as necessary.
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import time
import logging
from tqdm import tqdm
from dotenv import load_dotenv

load_dotenv()

# Set up logging
logging.basicConfig(filename='scraper.log', level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

# Define headers for the HTTP request
HEADERS = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

# Define proxy settings (if needed)
proxy = os.getenv("PROXY")

PROXIES = {
    'http': f'http://{proxy}',
    'https': f'http://{proxy}'
}


def fetch_data(url):
    try:
        response = requests.get(url, headers=HEADERS, proxies=PROXIES)
        response.raise_for_status()
        return response.content
    except requests.RequestException as e:
        logging.error(f"Error fetching data: {e}")
        return None


def parse_data(content):
    try:
        soup = BeautifulSoup(content, 'html.parser')
        script_content = soup.find('script', id='__NEXT_DATA__')

        if script_content:
            json_content = script_content.string
            return json.loads(json_content)
        else:
            logging.error("Could not find the required script tag.")
            return None
    except json.JSONDecodeError as e:
        logging.error(f"Error parsing JSON: {e}")
        return None


def save_to_csv(house_details, output_file, mode='a'):
    with open(output_file, mode, newline='', encoding='utf-8') as csvfile:
        csvwriter = csv.writer(csvfile)

        if mode == 'w':
            csvwriter.writerow([
                'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',
                'STREET', 'CITY', 'STATE', 'ZIP CODE',
                'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',
                'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'
            ])

        for detail in tqdm(house_details, desc="Saving house details", unit="house"):
            try:
                home_info = detail.get('hdpData', {}).get('homeInfo', {})
                photo_urls = ','.join([photo.get('url', '')
                                      for photo in detail.get('carouselPhotos', [])])

                # Concatenate lot area value and unit
                lot_size = f"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}"

                csvwriter.writerow([
                    detail.get('detailUrl', ''),
                    photo_urls,
                    detail.get('price', ''),
                    detail.get('address', ''),
                    detail.get('addressStreet', ''),
                    detail.get('addressCity', ''),
                    detail.get('addressState', ''),
                    detail.get('addressZipcode', ''),
                    home_info.get('bedrooms', ''),
                    home_info.get('bathrooms', ''),
                    home_info.get('livingArea', ''),
                    lot_size,
                    home_info.get('homeType', '').replace('_', ' ')
                ])
            except Exception as e:
                logging.error(f"Error processing house detail: {e}")
                logging.error(f"Problematic detail: {detail}")


def main():
    base_url = "https://www.zillow.com/ne"
    page = 1
    max_pages = 5  # Set this to the number of pages you want to scrape, or None for all pages

    output_directory = 'OUTPUT_1'
    os.makedirs(output_directory, exist_ok=True)
    file_name = f'house_details-1-{max_pages}.csv'
    output_file = os.path.join(output_directory, file_name)

    with tqdm(total=max_pages, desc="Scraping pages", unit="page") as pbar:
        while max_pages is None or page <= max_pages:
            if page == 1:
                url = base_url
            else:
                url = f"{base_url}/{page}_p"

            logging.info(f"Scraping page {page}: {url}")
            content = fetch_data(url)

            if content:
                data = parse_data(content)
                if data:
                    try:
                        house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
                        if house_details:
                            save_to_csv(house_details, output_file,
                                        mode='a' if page > 1 else 'w')
                            logging.info(
                                f"Data from page {page} has been saved to house_details-1-10.csv")
                        else:
                            logging.info(
                                f"No more results found on page {page}. Stopping.")
                            break
                    except KeyError as e:
                        logging.error(f"KeyError on page {page}: {e}")
                        logging.error(f"Data structure: {data}")
                        break
                else:
                    logging.error(
                        f"Failed to parse data from page {page}. Stopping.")
                    break
            else:
                logging.error(
                    f"Failed to fetch data from page {page}. Stopping.")
                break

            page += 1
            pbar.update(1)
            # Add a delay between requests to be respectful to the server
            time.sleep(5)

    logging.info("Scraping completed.")


if __name__ == "__main__":
    main()

[PART 2 ] Scrape the other information from the Properties page

  • Year Built, Description, Listing Date, Days on Zillow, Total Views, Total Saved, Realtor Name, Realtor Contact Number, Agency, Co-realtor Name, Co-realtor contact number, Co-realtor agency

To extract additional information from a Zillow property listing that is not available directly on the search results page, we need to send a GET request to the specific HOUSE URL. This will allow us to gather details such as the year built, description, listing updated date, realtor information, number of views, and number of saves.

First, we will define the HOUSE URL from which we want to extract the additional information. This URL may vary depending on the specific property you are scraping.

house_url = 'https://www.zillow.com/homedetails/7017-S-132nd-Ave-Omaha-NE-68138/58586050_zpid/'

response = requests.get(house_url, headers=HEADERS, proxies=PROXIES)
soup = BeautifulSoup(response.content, 'html.parser')

Since we already have the image urls we will be focusing inside this container which holds the relevant data for extraction.

content container

content = soup.find('div', class_='ds-data-view-list')

Now let’s extract the Year Built:

year built element

Since there are a few other elements with the same span tag and class name, we’re going to be more specific by finding the element with the text “Built in”

year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: "Built in" in text)
year_built = year.text.strip().replace('Built in ', '')
year_built

year built result

The property description can be found within a specific div tag identified by its data-testid.

description elements

description = content.find('div', attrs={'data-testid': 'description'}).text.strip()
description

Description output

If we notice at the end of the code there is a ‘Show more’ string. So let’s remove this by replacing this string with empty string.

description = content.find('div', attrs={'data-testid': 'description'}).text.strip().replace('Show more','')

Get the listing date:

listing date element

Similar to extracting the year built, we will find the listing updated date using a specific class name and filtering for relevant text.

listing_details = content.find_all('p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and "Listing updated" in text)
date_details = listing_details[0].text.strip()
date_details = listing_details[0].text.strip()
date_part = date_details.split(' at ')[0]
listing_date = date_part.replace('Listing updated: ', '').strip()

listing date output

Get the days on Zillow, total views and total saved

dt tag

These values can be found within dt tags. We will extract them based on their positions.

containers = content.find_all('dt')

dt container

days_on_zillow = containers[0].text.strip()
views = containers[2].text.strip()
total_save = containers[4].text.strip()

dt output

Finally, we will extract information about the realtor and their agency from specific p tags.

realtor element tag realtor container

If we expand the p tag, we can see the values that we want inside it.

realtor_content = content.find('p', attrs={'data-testid': 'attribution-LISTING_AGENT'}).text.strip().replace(',', '')
print('REALTOR:', realtor_content)

realtor output details

As we see from the output above, the realtor’s name and contact number are inside the same ‘element’ so let’s separate them to make our data look nice and clean.

name, contact = realtor_content.split('M:')
realtor_name = name.strip()
realtor_contact = contact.strip()
print('REALTOR NAME:', realtor_name)
print('REALTOR CONTACT NO:', realtor_contact)

realtor output seperate

agency_name = content.find('p', attrs={'data-testid': 'attribution-BROKER'}).text.strip().replace(',', '') print('OFFICE:', agency_name)

agency name

co_realtor_content = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'}).text.strip().replace(',', '') print('CO-REALTOR CONTENT:', co_realtor_content)

Same as before we need to split the name and contact number.

name_contact = co_realtor_content.rsplit(' ', 1)
name = name_contact[0]
contact = name_contact[1]
co_realtor_name = name.strip()
co_realtor_contact = contact.strip()
print(f"CO-REALTOR NAME: {co_realtor_name}")
print(f"CO-REALTOR CONTACT NO: {co_realtor_contact}")

co-realtor separate output

co_realtor_agency_name = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'}).text.strip() print('CO-REALTOR AGENCY NAME:', co_realtor_agency_name)

co-realtor agency

Complete code with the additional data

Let’s enhance our data collection process by creating a new Python file dedicated to fetching additional information. This script will first read the HOUSE URLs from the existing CSV file, sending requests for each URL to extract valuable data. Once all information is gathered, it will save the results in a new CSV file, preserving the original data for reference.

import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import time
import logging
from tqdm import tqdm
from dotenv import load_dotenv

load_dotenv()

# Set up logging
logging.basicConfig(filename='scraper.log', level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

# Define headers for the HTTP request
HEADERS = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

# Define proxy settings (if needed)
proxy = os.getenv("PROXY")

PROXIES = {
    'http': f'http://{proxy}',
    'https': f'http://{proxy}'
}


def fetch_data(url):
    try:
        response = requests.get(url, headers=HEADERS, proxies=PROXIES)
        response.raise_for_status()
        return response.content
    except requests.RequestException as e:
        logging.error(f"Error fetching data: {e}")
        return None


def parse_data(content):
    try:
        soup = BeautifulSoup(content, 'html.parser')
        script_content = soup.find('script', id='__NEXT_DATA__')

        if script_content:
            json_content = script_content.string
            return json.loads(json_content)
        else:
            logging.error("Could not find the required script tag.")
            return None
    except json.JSONDecodeError as e:
        logging.error(f"Error parsing JSON: {e}")
        return None


def save_to_csv(house_details, mode='a'):
    output_directory = 'OUTPUT_1'
    os.makedirs(output_directory, exist_ok=True)
    file_name = 'house_details-1-5.csv'  # Change accordingly
    output_file = os.path.join(output_directory, file_name)

    with open(output_file, mode, newline='', encoding='utf-8') as csvfile:
        csvwriter = csv.writer(csvfile)

        if mode == 'w':
            csvwriter.writerow([
                'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',
                'STREET', 'CITY', 'STATE', 'ZIP CODE',
                'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',
                'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'
            ])

        for detail in tqdm(house_details, desc="Saving house details", unit="house"):
            try:
                home_info = detail.get('hdpData', {}).get('homeInfo', {})
                photo_urls = ','.join([photo.get('url', '')
                                      for photo in detail.get('carouselPhotos', [])])

                # Concatenate lot area value and unit
                lot_size = f"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}"

                csvwriter.writerow([
                    detail.get('detailUrl', ''),
                    photo_urls,
                    detail.get('price', ''),
                    detail.get('address', ''),
                    detail.get('addressStreet', ''),
                    detail.get('addressCity', ''),
                    detail.get('addressState', ''),
                    detail.get('addressZipcode', ''),
                    home_info.get('bedrooms', ''),
                    home_info.get('bathrooms', ''),
                    home_info.get('livingArea', ''),
                    lot_size,
                    home_info.get('homeType', '').replace('_', ' ')
                ])
            except Exception as e:
                logging.error(f"Error processing house detail: {e}")
                logging.error(f"Problematic detail: {detail}")


def main():
    base_url = "https://www.zillow.com/ne"
    page = 1
    max_pages = 5  # Set this to the number of pages you want to scrape, or None for all pages

    with tqdm(total=max_pages, desc="Scraping pages", unit="page") as pbar:
        while max_pages is None or page <= max_pages:
            if page == 1:
                url = base_url
            else:
                url = f"{base_url}/{page}_p"

            logging.info(f"Scraping page {page}: {url}")
            content = fetch_data(url)

            if content:
                data = parse_data(content)
                if data:
                    try:
                        house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
                        if house_details:
                            save_to_csv(house_details,
                                        mode='a' if page > 1 else 'w')
                            logging.info(
                                f"Data from page {page} has been saved to house_details-1-10.csv")
                        else:
                            logging.info(
                                f"No more results found on page {page}. Stopping.")
                            break
                    except KeyError as e:
                        logging.error(f"KeyError on page {page}: {e}")
                        logging.error(f"Data structure: {data}")
                        break
                else:
                    logging.error(
                        f"Failed to parse data from page {page}. Stopping.")
                    break
            else:
                logging.error(
                    f"Failed to fetch data from page {page}. Stopping.")
                break

            page += 1
            pbar.update(1)
            # Add a delay between requests to be respectful to the server
            time.sleep(5)

    logging.info("Scraping completed.")


if __name__ == "__main__":
    main()

Why Create a New File?

The decision to generate a new file instead of overwriting the previous one serves as a safeguard. This approach ensures that we have a backup in case our code encounters issues or if access is blocked, allowing us to maintain data integrity throughout the process.

By implementing this strategy, we not only enhance our data collection capabilities but also ensure that we can troubleshoot effectively without losing any valuable information.

.

Complete code for the additional data with Proxy Rotation

Implementing proxy rotation is essential for avoiding anti-bot detection, especially when making numerous requests to a website. In this tutorial, we will demonstrate how to gather additional data from Zillow property listings while utilizing proxies from Rayobyte, which offers 50MB of residential proxy traffic for free upon signup..

Download and Prepare the Proxy List

Sign Up for Rayobyte: Create an account on Rayobyte to access their proxy services.

Generate Proxy List:

  • Navigate to the “Proxy List Generator” in your dashboard.
  • Set the format to username:password@hostname:port.
  • Download the proxy list.

Move the Proxy File: Locate the downloaded file in your downloads directory and move it to your code directory.

rayobyte dashboard

Implement Proxy Rotation in Your Code

import os
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import pandas as pd
import random
import time
import logging
from tqdm import tqdm

load_dotenv()

# Set up logging
logging.basicConfig(filename='scraper.log', level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S')

HEADERS = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}


def load_proxies(file_path):
    with open(file_path, 'r') as f:
        return [line.strip() for line in f if line.strip()]


PROXY_LIST = load_proxies('proxy-list.txt')


def get_random_proxy():
    return random.choice(PROXY_LIST)


def get_proxies(proxy):
    return {
        'http': f'http://{proxy}',
        'https': f'http://{proxy}'
    }


def scrape_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        proxy = get_random_proxy()
        proxies = get_proxies(proxy)
        try:
            response = requests.get(
                url, headers=HEADERS, proxies=proxies, timeout=30)
            if response.status_code == 200:
                return response
            else:
                logging.warning(
                    f"Attempt {attempt + 1} failed with status code {response.status_code} for URL: {url}")
        except requests.RequestException as e:
            logging.error(
                f"Attempt {attempt + 1} failed with error: {e} for URL: {url}")

        time.sleep(random.uniform(1, 3))

    logging.error(
        f"Failed to fetch data for {url} after {max_retries} attempts.")
    return None


def scrape_house_data(house_url):
    response = scrape_with_retry(house_url)
    if not response:
        return None

    soup = BeautifulSoup(response.content, 'html.parser')
    content = soup.find('div', class_='ds-data-view-list')

    if not content:
        logging.error(f"Failed to find content for {house_url}")
        return None

    year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0',
                        string=lambda text: "Built in" in text)
    year_built = year.text.strip().replace('Built in ', '') if year else "N/A"

    description_elem = content.find(
        'div', attrs={'data-testid': 'description'})
    description = description_elem.text.strip().replace(
        'Show more', '') if description_elem else "N/A"

    listing_details = content.find_all(
        'p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and "Listing updated" in text)
    listing_date = "N/A"
    if listing_details:
        date_details = listing_details[0].text.strip()
        date_part = date_details.split(' at ')[0]
        listing_date = date_part.replace('Listing updated: ', '').strip()

    containers = content.find_all('dt')
    days_on_zillow = containers[0].text.strip() if len(
        containers) > 0 else "N/A"
    views = containers[2].text.strip() if len(containers) > 2 else "N/A"
    total_save = containers[4].text.strip() if len(containers) > 4 else "N/A"

    realtor_elem = content.find(
        'p', attrs={'data-testid': 'attribution-LISTING_AGENT'})
    if realtor_elem:
        realtor_content = realtor_elem.text.strip().replace(',', '')
        if 'M:' in realtor_content:
            name, contact = realtor_content.split('M:')
        else:
            name_contact = realtor_content.rsplit(' ', 1)
            name = name_contact[0]
            contact = name_contact[1]

        realtor_name = name.strip()
        realtor_contact = contact.strip()

    else:
        realtor_name = "N/A"
        realtor_contact = "N/A"

    agency_elem = content.find(
        'p', attrs={'data-testid': 'attribution-BROKER'})
    agency_name = agency_elem.text.strip().replace(',', '') if agency_elem else "N/A"

    co_realtor_elem = content.find(
        'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'})
    if co_realtor_elem:
        co_realtor_content = co_realtor_elem.text.strip().replace(',', '')
        if 'M:' in co_realtor_content:
            name, contact = co_realtor_content.split('M:')
        else:
            name_contact = co_realtor_content.rsplit(' ', 1)
            name = name_contact[0]
            contact = name_contact[1]

        co_realtor_name = name.strip()
        co_realtor_contact = contact.strip()

    else:
        co_realtor_name = "N/A"
        co_realtor_contact = "N/A"

    co_realtor_agency_elem = content.find(
        'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'})
    co_realtor_agency_name = co_realtor_agency_elem.text.strip(
    ) if co_realtor_agency_elem else "N/A"

    return {
        'YEAR BUILT': year_built,
        'DESCRIPTION': description,
        'LISTING DATE': listing_date,
        'DAYS ON ZILLOW': days_on_zillow,
        'TOTAL VIEWS': views,
        'TOTAL SAVED': total_save,
        'REALTOR NAME': realtor_name,
        'REALTOR CONTACT NO': realtor_contact,
        'AGENCY': agency_name,
        'CO-REALTOR NAME': co_realtor_name,
        'CO-REALTOR CONTACT NO': co_realtor_contact,
        'CO-REALTOR AGENCY': co_realtor_agency_name
    }


def ensure_output_directory(directory):
    if not os.path.exists(directory):
        os.makedirs(directory)
        logging.info(f"Created output directory: {directory}")


def load_progress(output_file):
    if os.path.exists(output_file):
        return pd.read_csv(output_file)
    return pd.DataFrame()


def save_progress(df, output_file):
    df.to_csv(output_file, index=False)
    logging.info(f"Progress saved to {output_file}")


def main():
    input_file = './OUTPUT_1/house_details.csv'

    output_directory = 'OUTPUT_2'
    file_name = 'house_details_scraped.csv'
    output_file = os.path.join(output_directory, file_name)
    ensure_output_directory(output_directory)

    df = pd.read_csv(input_file)

    # Load existing progress
    result_df = load_progress(output_file)

    # Determine which URLs have already been scraped
    scraped_urls = set(result_df['HOUSE URL']
                      ) if 'HOUSE URL' in result_df.columns else set()

    # Scrape data for each house URL
    for _, row in tqdm(df.iterrows(), total=df.shape[0], desc="Scraping Progress"):
        house_url = row['HOUSE URL']

        # Skip if already scraped
        if house_url in scraped_urls:
            continue

        logging.info(f"Scraping data for {house_url}")
        data = scrape_house_data(house_url)

        if data:
            # Combine the original row data with the scraped data
            combined_data = {**row.to_dict(), **data}
            new_row = pd.DataFrame([combined_data])

            # Append the new row to the result DataFrame
            result_df = pd.concat([result_df, new_row], ignore_index=True)

            # Save progress after each successful scrape
            save_progress(result_df, output_file)

        # Add a random delay between requests (1 to 5 seconds)
        time.sleep(random.uniform(1, 5))

    logging.info(f"Scraping completed. Final results saved to {output_file}")
    print(
        f"Scraping completed. Check {output_file} for results and scraper.log for detailed logs.")


if __name__ == "__main__":
    main()

Conclusion

In conclusion, this comprehensive guide on Zillow scraping with Python has equipped you with essential tools and techniques to effectively extract property listings and home prices. By following the outlined steps, you have learned how to navigate the complexities of web scraping, including overcoming anti-bot measures and utilizing proxies for seamless data retrieval.

Key takeaways from this tutorial include:

  • Understanding the Ethical Considerations: Emphasizing responsible scraping practices to respect website performance and legal guidelines.
  • Scraping Workflow: Dividing the scraping process into manageable parts for clarity and efficiency.
  • Technical Implementation: Utilizing Python libraries such as requests, BeautifulSoup, and json for data extraction.
  • Data Storage: Saving extracted information in CSV format for easy access and analysis.

As you implement these strategies, you will gain valuable insights into real estate trends and market dynamics, empowering you to make informed decisions based on the data collected. With the provided source code and detailed explanations, you are now well-prepared to adapt this project to your specific needs, whether that involves expanding your data collection or refining your analysis techniques. Embrace the power of data-driven insights as you explore the vast landscape of real estate information available through platforms like Zillow. Drop a comment below if you have any questions and Happy scraping!

Source code: zillow_properties_for_sale_scraper

Video: Extract data from Zillow properties for sale listing using Python

Responses

Related Projects