Zillow Scraping with Python: Extract Property Listings and Home Prices

Zillow Scraping with Python: Extract Property Listings and Home Prices

Source code: zillow_properties_for_sale_scraper 

Table of Content

Introduction
Ethical Consideration
Scraping Workflow
Prerequisites
Project Setup
[PART 1] Scraping Zillow Data from the search page
Complete code for the first page
Get the information from the next page
Complete code for all pages
[PART 2 ] Scrape the other information from the Properties page
Complete code for the additional data
Complete code with the additional data with Proxy Rotation
Conclusion

Introduction

Zillow is a go-to platform for real estate data, featuring millions of property listings with detailed information on prices, locations, and home features. In this tutorial, we’ll guide you through the process of Zillow scraping using Python.  You will learn how to extract essential property details such as home prices and geographic data, which will empower you to track market trends, analyze property values, and compare listings across various regions. This guide includes source code and techniques to effectively implement Zillow scraping.

In this tutorial, we will focus on collecting data for houses listed for sale in Nebraska. Our starting URL will be:

The information that we want to scrape are:

  • House URL
  • Images
  • Price
  • Address
  • Number of bedroom(s)
  • Number of bathroom(s)
  • House Size
  • Lot Size
  • House Type
  • Year Built
  • Description
  • Listing Date
  • Days on Zillow
  • Total Views
  • Total Saved
  • Realtor Name
  • Realtor Contact Number
  • Agency
  • Co-realtor Name
  • Co-realtor contact number
  • Co-realtor agency

Ethical Consideration

Before we dive into the technical aspects of scraping Zillow, it’s important to emphasize that this tutorial is intended for educational purposes only. When interacting with public servers, it’s vital to maintain a responsible approach. Here are some essential guidelines to keep in mind:

  • Respect Website Performance: Avoid scraping at a speed that could negatively impact the website’s performance or availability.
  • Public Data Only: Ensure that you only scrape data that is publicly accessible. Respect any restrictions set by the website.
  • No Redistribution of Data: Refrain from redistributing entire public datasets, as this may violate legal regulations in certain jurisdictions.

Scraping Workflow

The Zillow scraper can be effectively divided into two parts, each focusing on different aspects of data extraction.

The first part involves extracting essential information from the Zillow search results page which consists of this information.

HOUSE URLs, PHOTO URLs, PRICE, FULL ADDRESS, STREET, CITY, STATE, ZIP CODE, NUMBER OF BEDROOMS, NUMBER OF BATHROOMS, HOUSE SIZE, LOT SIZE and HOUSE TYPE

search page

It is important to note that while the search page provides a wealth of information, it does not display LOT SIZE and HOUSE TYPE directly. However, these values are accessible through the backend which I’ll show you later.

The second part is to scrape the rest of the information from the particular HOUSE URLs page which includes:

YEAR BUILT, DESCRIPTION, LISTING DATE, DAYS ON ZILLOW, TOTAL VIEWSTOTAL SAVED, REALTOR NAME, REALTOR CONTACT NO, AGENCY, CO-REALTOR NAME, CO-REALTOR CONTACT NO and CO-REALTOR AGENCY

House page

Prerequisites

Before starting this project, ensure you have the following:

  • Python Installed: Make sure Python is installed on your machine.
  • Proxy Usage: It is highly recommended to use a proxy for this project to avoid detection and potential blocking. For this tutorial, we will use a residential proxy from Rayobyte. You can sign up for a free trial that offers 50MB of usage without requiring a credit card.

Project Setup

  1. Create a new folder in your desired directory to house your project files.
  2. Open your terminal in the directory you just created and run the following command to install the necessary libraries:
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
pip install requests beautifulsoup4
pip install requests beautifulsoup4
pip install requests beautifulsoup4

3. If you are using a proxy, I suggest you install the python-dotenv package as well.  To store your credentials in .env file

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
pip install python-dotenv
pip install python-dotenv
pip install python-dotenv

4. Open your preferred code editor (for example, Visual Studio Code) and create a new file with the extension .ipynb. This will create a new Jupyter notebook within VS Code.

[PART 1] Scraping Zillow Data from the search page

  • House URL, Images, Price, Address, Number of bedroom(s), Number of bathroom(s), House Size, Lot Size and House Type

In this section, we will implement the code to scrape property data from Zillow. We will cover everything from importing libraries to saving the extracted information in a CSV file.

First, we need to import the libraries that will help us with HTTP requests and HTML parsing.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import requests
from bs4 import BeautifulSoup
import json
import requests from bs4 import BeautifulSoup import json
import requests
from bs4 import BeautifulSoup
import json

Setting headers helps disguise our request as if it’s coming from a real browser, which can help avoid detection.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
headers = {
"User-Agent": Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
headers = { "User-Agent": Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36, "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", }
headers = {
    "User-Agent": Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36,
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

If you have a proxy, include it in your requests to avoid potential blocks.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
proxies = {
'http': 'http://username:password@host:port',
'https': 'http://username:password@host:port'
}
proxies = { 'http': 'http://username:password@host:port', 'https': 'http://username:password@host:port' }
proxies = {
    'http': 'http://username:password@host:port',
    'https': 'http://username:password@host:port'
}

Make sure to replace username, password, host, and port with your actual proxy credentials.

Or you can create a .env file to store your proxy credentials and load you proxies like this

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from dotenv import load_dotenv
load_dotenv()
proxy = os.getenv("PROXY")
PROXIES = {
'http': f'http://{proxy}',
'https': f'http://{proxy}'
}
from dotenv import load_dotenv load_dotenv() proxy = os.getenv("PROXY") PROXIES = { 'http': f'http://{proxy}', 'https': f'http://{proxy}' }
from dotenv import load_dotenv
load_dotenv()

proxy = os.getenv("PROXY")
PROXIES = {
    'http': f'http://{proxy}',
    'https': f'http://{proxy}'
}

Define the URL for the state you want to scrape—in this case, Nebraska.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
url = "https://www.zillow.com/ne"
url = "https://www.zillow.com/ne"
url = "https://www.zillow.com/ne"

Send a GET request to the server using the headers and proxies defined earlier.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
response = requests.get(url, headers=headers, proxies=proxies) # Use proxies if available
# If you don't have a proxy:
# response = requests.get(url, headers=headers)
response = requests.get(url, headers=headers, proxies=proxies) # Use proxies if available # If you don't have a proxy: # response = requests.get(url, headers=headers)
response = requests.get(url, headers=headers, proxies=proxies)  # Use proxies if available
# If you don't have a proxy:
# response = requests.get(url, headers=headers)

Use BeautifulSoup to parse the HTML content of the page.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
soup = BeautifulSoup(response.content, 'html.parser')
soup = BeautifulSoup(response.content, 'html.parser')
soup = BeautifulSoup(response.content, 'html.parser')

Extract House URLs from Listing Cards

The first thing that we want from the first landing page is to extract all the House URLs. Normally these URLs are available inside the “listing cards”.

Listing card

Inspecting the “Listing Card”

To inspect the element, “right-click” anywhere and click on “Inspect” or simply press F12. Click on this arrow icon and start hovering on the element that we want.

Hover arrow icon listing card element

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
listing_card = soup.find_all('li', class_='ListItem-c11n-8-105-0__sc-13rwu5a-0')
print(len(listing_card))
listing_card = soup.find_all('li', class_='ListItem-c11n-8-105-0__sc-13rwu5a-0') print(len(listing_card))
listing_card = soup.find_all('li', class_='ListItem-c11n-8-105-0__sc-13rwu5a-0')
print(len(listing_card))

listing len

As we see here there are 42 listings inside this first page.

Now, let’s try getting the url. If we expand the li tag we will notice there are a tag and the url is in the href attribute.:

House url html tag

To get this value, let’s test by extracting inside the first listing. Therefore, we need to specify that we want the information from the first only.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
card = listing_card[0]
card = listing_card[0]
card = listing_card[0]
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
house_url = card.find('a').get('href')
print('URL:', house_url)
house_url = card.find('a').get('href') print('URL:', house_url)
house_url = card.find('a').get('href')
print('URL:', house_url)

house url result

This works fine until here. However, as you may know or not, Zillow has strong anti-bot detection mechanisms. Therefore by using this method, you’ll get the response to 10 urls only instead of 42, which is the total listing that appears on the first page. 

Overcome Anti-Bot Detection by Extracting the data from JSON format

To overcome this issue, I found another approach by using the “Javascript-rendered” value that returns from the web page. If we scroll down the “inspect” page, we will find a script tag with the id=”__NEXT_DATA__”

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
content = soup.find('script', id='__NEXT_DATA__')
content = soup.find('script', id='__NEXT_DATA__')
content = soup.find('script', id='__NEXT_DATA__')

Convert the content to json format.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import json
import json
import json
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
json_content = content.string
data = json.loads(json_content)
json_content = content.string data = json.loads(json_content)
json_content = content.string
data = json.loads(json_content)

Save this JSON data for easier inspection later:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
with open('output.json', 'w') as json_file:
json.dump(data, json_file, indent=4)
with open('output.json', 'w') as json_file: json.dump(data, json_file, indent=4)
with open('output.json', 'w') as json_file:
    json.dump(data, json_file, indent=4)

After running this code, you’ll get the output.json file inside your folder. 

Open the file to locate the URL. I’m using ctrl+f to find the URL location inside my VScode.

json output

Notice here the URL is inside the "detailUrl". Apart from that, it returns other useful information as well. 

To extract the value inside this json file:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']

Get the first listing

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
detail = house_details[0]
detail = house_details[0]
detail = house_details[0]
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
house_url = detail['detailUrl']
house_url
house_url = detail['detailUrl'] house_url
house_url = detail['detailUrl']
house_url

10 house json url

We get the same value as before.

To get all the URLs from the first page:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
house_urls = [detail['detailUrl'] for detail in house_details]
house_urls = [detail['detailUrl'] for detail in house_details]
house_urls = [detail['detailUrl'] for detail in house_details]

By using this method, we are able to get all the URL.

all house url

As we inspect our json file, we can see other information that we’re interested in as well. So let’s get these values from here. 

Image URLs

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
photo_urls = [photo['url'] for photo in detail['carouselPhotos']]
photo_urls = [photo['url'] for photo in detail['carouselPhotos']]
photo_urls = [photo['url'] for photo in detail['carouselPhotos']]

Photos URLs

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
price = detail['price']
full_address = detail['address']
address_street = detail['addressStreet']
city = detail['addressCity']
state = detail['addressState']
zipcode = detail['addressZipcode']
home_info = detail['hdpData']['homeInfo']
bedrooms = home_info['bedrooms']
bathrooms = home_info['bathrooms']
house_size = home_info['livingArea']
lot_size = home_info['lotAreaValue']
house_type = home_info['homeType']
price = detail['price'] full_address = detail['address'] address_street = detail['addressStreet'] city = detail['addressCity'] state = detail['addressState'] zipcode = detail['addressZipcode'] home_info = detail['hdpData']['homeInfo'] bedrooms = home_info['bedrooms'] bathrooms = home_info['bathrooms'] house_size = home_info['livingArea'] lot_size = home_info['lotAreaValue'] house_type = home_info['homeType']
price = detail['price']
full_address = detail['address']
address_street = detail['addressStreet']
city = detail['addressCity']
state = detail['addressState']
zipcode = detail['addressZipcode']
home_info = detail['hdpData']['homeInfo']
bedrooms = home_info['bedrooms']
bathrooms = home_info['bathrooms']
house_size = home_info['livingArea']
lot_size = home_info['lotAreaValue']
house_type = home_info['homeType']

13 part 1 data

Save all the information in CSV file

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import csv
# Open a new CSV file for writing
with open('house_details.csv', 'w', newline='', encoding='utf-8') as csvfile:
# Create a CSV writer object
csvwriter = csv.writer(csvfile)
# Write the header row
csvwriter.writerow(['HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE',
'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOM', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'])
# Iterate through the house details and write each row
for detail in house_details:
house_url = detail['detailUrl']
photo_urls = ','.join([photo['url'] for photo in detail['carouselPhotos']])
price = detail['price']
full_address = detail['address']
address_street = detail['addressStreet']
city = detail['addressCity']
state = detail['addressState']
zipcode = detail['addressZipcode']
home_info = detail['hdpData']['homeInfo']
bedrooms = home_info['bedrooms']
bathrooms = home_info['bathrooms']
house_size = home_info['livingArea']
lot_size = home_info['lotAreaValue'] lot_unit = home_info['lotAreaUnit']
house_type = home_info['homeType']
# Write the row to the CSV file
csvwriter.writerow([house_url, photo_urls, price, full_address, address_street, city, state, zipcode, bedrooms, bathrooms, house_size, f'{lot_size} {lot_unit}', house_type])
print("Data has been saved to house_details.csv")
import csv # Open a new CSV file for writing with open('house_details.csv', 'w', newline='', encoding='utf-8') as csvfile: # Create a CSV writer object csvwriter = csv.writer(csvfile) # Write the header row csvwriter.writerow(['HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE', 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOM', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE']) # Iterate through the house details and write each row for detail in house_details: house_url = detail['detailUrl'] photo_urls = ','.join([photo['url'] for photo in detail['carouselPhotos']]) price = detail['price'] full_address = detail['address'] address_street = detail['addressStreet'] city = detail['addressCity'] state = detail['addressState'] zipcode = detail['addressZipcode'] home_info = detail['hdpData']['homeInfo'] bedrooms = home_info['bedrooms'] bathrooms = home_info['bathrooms'] house_size = home_info['livingArea'] lot_size = home_info['lotAreaValue'] lot_unit = home_info['lotAreaUnit'] house_type = home_info['homeType'] # Write the row to the CSV file csvwriter.writerow([house_url, photo_urls, price, full_address, address_street, city, state, zipcode, bedrooms, bathrooms, house_size, f'{lot_size} {lot_unit}', house_type]) print("Data has been saved to house_details.csv")
import csv

# Open a new CSV file for writing
with open('house_details.csv', 'w', newline='', encoding='utf-8') as csvfile:
    # Create a CSV writer object
    csvwriter = csv.writer(csvfile)
   
    # Write the header row
    csvwriter.writerow(['HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE',
                        'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOM', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'])
   
    # Iterate through the house details and write each row
    for detail in house_details:
        house_url = detail['detailUrl']
        photo_urls = ','.join([photo['url'] for photo in detail['carouselPhotos']])
        price = detail['price']
        full_address = detail['address']
        address_street = detail['addressStreet']
        city = detail['addressCity']
        state = detail['addressState']
        zipcode = detail['addressZipcode']
        home_info = detail['hdpData']['homeInfo']
        bedrooms = home_info['bedrooms']
        bathrooms = home_info['bathrooms']
        house_size = home_info['livingArea']
        lot_size = home_info['lotAreaValue']        lot_unit = home_info['lotAreaUnit']
        house_type = home_info['homeType']
       
        # Write the row to the CSV file
        csvwriter.writerow([house_url, photo_urls, price, full_address, address_street, city, state, zipcode, bedrooms, bathrooms, house_size, f'{lot_size} {lot_unit}', house_type])

print("Data has been saved to house_details.csv")

This is all the output from the first page which is 41 in total.

14 csv output

Complete code for the first page

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
from dotenv import load_dotenv
load_dotenv()
# Define headers for the HTTP request
HEADERS = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
# Define proxy settings (if needed)
proxy = os.getenv("PROXY")
PROXIES = {
'http': f'http://{proxy}',
'https': f'http://{proxy}'
}
def fetch_data(url):
try:
response = requests.get(url, headers=HEADERS, proxies=PROXIES)
response.raise_for_status()
return response.content
except requests.RequestException as e:
print(f"Error fetching data: {e}")
return None
def parse_data(content):
soup = BeautifulSoup(content, 'html.parser')
script_content = soup.find('script', id='__NEXT_DATA__')
if script_content:
json_content = script_content.string
return json.loads(json_content)
else:
print("Could not find the required script tag.")
return None
def save_to_csv(house_details, output_file):
with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow([
'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',
'STREET', 'CITY', 'STATE', 'ZIP CODE',
'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',
'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'
])
for detail in house_details:
home_info = detail['hdpData']['homeInfo']
photo_urls = ','.join([photo['url']
for photo in detail['carouselPhotos']])
# Concatenate lot area value and unit
lot_size = f"{home_info.get('lotAreaValue')} {home_info.get('lotAreaUnit')}"
csvwriter.writerow([
detail['detailUrl'],
photo_urls,
detail['price'],
detail['address'],
detail['addressStreet'],
detail['addressCity'],
detail['addressState'],
detail['addressZipcode'],
home_info.get('bedrooms'),
home_info.get('bathrooms'),
home_info.get('livingArea'),
lot_size,
home_info.get('homeType').replace('_', ' ')
])
def main():
URL = "https://www.zillow.com/ne"
content = fetch_data(URL)
output_directory = 'OUTPUT_1'
os.makedirs(output_directory, exist_ok=True)
file_name = 'house_details_first_page.csv'
output_file = os.path.join(output_directory, file_name)
if content:
data = parse_data(content)
if data:
house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
save_to_csv(house_details, output_file)
print(f"Data has been saved to {output_file}")
if __name__ == "__main__":
main()
import os import requests from bs4 import BeautifulSoup import json import csv from dotenv import load_dotenv load_dotenv() # Define headers for the HTTP request HEADERS = { "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36', "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } # Define proxy settings (if needed) proxy = os.getenv("PROXY") PROXIES = { 'http': f'http://{proxy}', 'https': f'http://{proxy}' } def fetch_data(url): try: response = requests.get(url, headers=HEADERS, proxies=PROXIES) response.raise_for_status() return response.content except requests.RequestException as e: print(f"Error fetching data: {e}") return None def parse_data(content): soup = BeautifulSoup(content, 'html.parser') script_content = soup.find('script', id='__NEXT_DATA__') if script_content: json_content = script_content.string return json.loads(json_content) else: print("Could not find the required script tag.") return None def save_to_csv(house_details, output_file): with open(output_file, 'w', newline='', encoding='utf-8') as csvfile: csvwriter = csv.writer(csvfile) csvwriter.writerow([ 'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE', 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE' ]) for detail in house_details: home_info = detail['hdpData']['homeInfo'] photo_urls = ','.join([photo['url'] for photo in detail['carouselPhotos']]) # Concatenate lot area value and unit lot_size = f"{home_info.get('lotAreaValue')} {home_info.get('lotAreaUnit')}" csvwriter.writerow([ detail['detailUrl'], photo_urls, detail['price'], detail['address'], detail['addressStreet'], detail['addressCity'], detail['addressState'], detail['addressZipcode'], home_info.get('bedrooms'), home_info.get('bathrooms'), home_info.get('livingArea'), lot_size, home_info.get('homeType').replace('_', ' ') ]) def main(): URL = "https://www.zillow.com/ne" content = fetch_data(URL) output_directory = 'OUTPUT_1' os.makedirs(output_directory, exist_ok=True) file_name = 'house_details_first_page.csv' output_file = os.path.join(output_directory, file_name) if content: data = parse_data(content) if data: house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults'] save_to_csv(house_details, output_file) print(f"Data has been saved to {output_file}") if __name__ == "__main__": main()
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
from dotenv import load_dotenv

load_dotenv()

# Define headers for the HTTP request
HEADERS = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

# Define proxy settings (if needed)
proxy = os.getenv("PROXY")

PROXIES = {
    'http': f'http://{proxy}',
    'https': f'http://{proxy}'
}


def fetch_data(url):
    try:
        response = requests.get(url, headers=HEADERS, proxies=PROXIES)
        response.raise_for_status()
        return response.content
    except requests.RequestException as e:
        print(f"Error fetching data: {e}")
        return None


def parse_data(content):
    soup = BeautifulSoup(content, 'html.parser')
    script_content = soup.find('script', id='__NEXT_DATA__')

    if script_content:
        json_content = script_content.string
        return json.loads(json_content)
    else:
        print("Could not find the required script tag.")
        return None


def save_to_csv(house_details, output_file):
    with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
        csvwriter = csv.writer(csvfile)

        csvwriter.writerow([
            'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',
            'STREET', 'CITY', 'STATE', 'ZIP CODE',
            'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',
            'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'
        ])

        for detail in house_details:
            home_info = detail['hdpData']['homeInfo']
            photo_urls = ','.join([photo['url']
                                  for photo in detail['carouselPhotos']])

            # Concatenate lot area value and unit
            lot_size = f"{home_info.get('lotAreaValue')} {home_info.get('lotAreaUnit')}"

            csvwriter.writerow([
                detail['detailUrl'],
                photo_urls,
                detail['price'],
                detail['address'],
                detail['addressStreet'],
                detail['addressCity'],
                detail['addressState'],
                detail['addressZipcode'],
                home_info.get('bedrooms'),
                home_info.get('bathrooms'),
                home_info.get('livingArea'),
                lot_size,
                home_info.get('homeType').replace('_', ' ')
            ])


def main():
    URL = "https://www.zillow.com/ne"
    content = fetch_data(URL)

    output_directory = 'OUTPUT_1'
    os.makedirs(output_directory, exist_ok=True)
    file_name = 'house_details_first_page.csv'
    output_file = os.path.join(output_directory, file_name)

    if content:
        data = parse_data(content)
        if data:
            house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
            save_to_csv(house_details, output_file)
            print(f"Data has been saved to {output_file}")


if __name__ == "__main__":
    main()

After running this code, it will create a new folder named OUTPUT_1 and you’ll find the file name house_details_first_page.csv inside it.

Get the information from the next page

First, take a look at the URLs for the pages we want to scrape:

  • First Page: https://www.zillow.com/ne 
  • Second Page: https://www.zillow.com/ne/2_p
  • Third Page: https://www.zillow.com/ne/3_p 

Notice how the page number increments by 1 with each subsequent page.

To automate the scraping process, we will utilize a while loop that iterates through all the pages. Here’s how we can set it up:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
base_url = "https://www.zillow.com/ne"
page = 1
max_pages = 10 # Adjust this to scrape more pages, or set to None for all pages
while max_pages is None or page <= max_pages:
if page == 1:
url = base_url
else:
url = f"{base_url}/{page}_p"
base_url = "https://www.zillow.com/ne" page = 1 max_pages = 10 # Adjust this to scrape more pages, or set to None for all pages while max_pages is None or page <= max_pages: if page == 1: url = base_url else: url = f"{base_url}/{page}_p"
base_url = "https://www.zillow.com/ne"
page = 1
max_pages = 10  # Adjust this to scrape more pages, or set to None for all pages

while max_pages is None or page <= max_pages:
    if page == 1:
        url = base_url
    else:
        url = f"{base_url}/{page}_p"

Complete code for all pages

Here’s the complete code to scrape all the pages.

Below is the complete code that scrapes all specified pages. We will also use tqdm to monitor our scraping progress. To install tqdm, run:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
pip install tqdm
pip install tqdm
pip install tqdm

Additionally, we’ll implement logging to capture any errors during execution. A log file named scraper.log will be created to store these logs.

Important Notes

  • The current setup limits scraping to 5 pages. To modify this scraper to extract data from all available pages, simply change max_pages on line 101 to None.
  • Don’t forget to update your proxy credentials as necessary.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import time
import logging
from tqdm import tqdm
from dotenv import load_dotenv
load_dotenv()
# Set up logging
logging.basicConfig(filename='scraper.log', level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
# Define headers for the HTTP request
HEADERS = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
# Define proxy settings (if needed)
proxy = os.getenv("PROXY")
PROXIES = {
'http': f'http://{proxy}',
'https': f'http://{proxy}'
}
def fetch_data(url):
try:
response = requests.get(url, headers=HEADERS, proxies=PROXIES)
response.raise_for_status()
return response.content
except requests.RequestException as e:
logging.error(f"Error fetching data: {e}")
return None
def parse_data(content):
try:
soup = BeautifulSoup(content, 'html.parser')
script_content = soup.find('script', id='__NEXT_DATA__')
if script_content:
json_content = script_content.string
return json.loads(json_content)
else:
logging.error("Could not find the required script tag.")
return None
except json.JSONDecodeError as e:
logging.error(f"Error parsing JSON: {e}")
return None
def save_to_csv(house_details, output_file, mode='a'):
with open(output_file, mode, newline='', encoding='utf-8') as csvfile:
csvwriter = csv.writer(csvfile)
if mode == 'w':
csvwriter.writerow([
'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',
'STREET', 'CITY', 'STATE', 'ZIP CODE',
'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',
'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'
])
for detail in tqdm(house_details, desc="Saving house details", unit="house"):
try:
home_info = detail.get('hdpData', {}).get('homeInfo', {})
photo_urls = ','.join([photo.get('url', '')
for photo in detail.get('carouselPhotos', [])])
# Concatenate lot area value and unit
lot_size = f"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}"
csvwriter.writerow([
detail.get('detailUrl', ''),
photo_urls,
detail.get('price', ''),
detail.get('address', ''),
detail.get('addressStreet', ''),
detail.get('addressCity', ''),
detail.get('addressState', ''),
detail.get('addressZipcode', ''),
home_info.get('bedrooms', ''),
home_info.get('bathrooms', ''),
home_info.get('livingArea', ''),
lot_size,
home_info.get('homeType', '').replace('_', ' ')
])
except Exception as e:
logging.error(f"Error processing house detail: {e}")
logging.error(f"Problematic detail: {detail}")
def main():
base_url = "https://www.zillow.com/ne"
page = 1
max_pages = 5 # Set this to the number of pages you want to scrape, or None for all pages
output_directory = 'OUTPUT_1'
os.makedirs(output_directory, exist_ok=True)
file_name = f'house_details-1-{max_pages}.csv'
output_file = os.path.join(output_directory, file_name)
with tqdm(total=max_pages, desc="Scraping pages", unit="page") as pbar:
while max_pages is None or page <= max_pages:
if page == 1:
url = base_url
else:
url = f"{base_url}/{page}_p"
logging.info(f"Scraping page {page}: {url}")
content = fetch_data(url)
if content:
data = parse_data(content)
if data:
try:
house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
if house_details:
save_to_csv(house_details, output_file,
mode='a' if page > 1 else 'w')
logging.info(
f"Data from page {page} has been saved to house_details-1-10.csv")
else:
logging.info(
f"No more results found on page {page}. Stopping.")
break
except KeyError as e:
logging.error(f"KeyError on page {page}: {e}")
logging.error(f"Data structure: {data}")
break
else:
logging.error(
f"Failed to parse data from page {page}. Stopping.")
break
else:
logging.error(
f"Failed to fetch data from page {page}. Stopping.")
break
page += 1
pbar.update(1)
# Add a delay between requests to be respectful to the server
time.sleep(5)
logging.info("Scraping completed.")
if __name__ == "__main__":
main()
import os import requests from bs4 import BeautifulSoup import json import csv import time import logging from tqdm import tqdm from dotenv import load_dotenv load_dotenv() # Set up logging logging.basicConfig(filename='scraper.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') # Define headers for the HTTP request HEADERS = { "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36', "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } # Define proxy settings (if needed) proxy = os.getenv("PROXY") PROXIES = { 'http': f'http://{proxy}', 'https': f'http://{proxy}' } def fetch_data(url): try: response = requests.get(url, headers=HEADERS, proxies=PROXIES) response.raise_for_status() return response.content except requests.RequestException as e: logging.error(f"Error fetching data: {e}") return None def parse_data(content): try: soup = BeautifulSoup(content, 'html.parser') script_content = soup.find('script', id='__NEXT_DATA__') if script_content: json_content = script_content.string return json.loads(json_content) else: logging.error("Could not find the required script tag.") return None except json.JSONDecodeError as e: logging.error(f"Error parsing JSON: {e}") return None def save_to_csv(house_details, output_file, mode='a'): with open(output_file, mode, newline='', encoding='utf-8') as csvfile: csvwriter = csv.writer(csvfile) if mode == 'w': csvwriter.writerow([ 'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE', 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE' ]) for detail in tqdm(house_details, desc="Saving house details", unit="house"): try: home_info = detail.get('hdpData', {}).get('homeInfo', {}) photo_urls = ','.join([photo.get('url', '') for photo in detail.get('carouselPhotos', [])]) # Concatenate lot area value and unit lot_size = f"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}" csvwriter.writerow([ detail.get('detailUrl', ''), photo_urls, detail.get('price', ''), detail.get('address', ''), detail.get('addressStreet', ''), detail.get('addressCity', ''), detail.get('addressState', ''), detail.get('addressZipcode', ''), home_info.get('bedrooms', ''), home_info.get('bathrooms', ''), home_info.get('livingArea', ''), lot_size, home_info.get('homeType', '').replace('_', ' ') ]) except Exception as e: logging.error(f"Error processing house detail: {e}") logging.error(f"Problematic detail: {detail}") def main(): base_url = "https://www.zillow.com/ne" page = 1 max_pages = 5 # Set this to the number of pages you want to scrape, or None for all pages output_directory = 'OUTPUT_1' os.makedirs(output_directory, exist_ok=True) file_name = f'house_details-1-{max_pages}.csv' output_file = os.path.join(output_directory, file_name) with tqdm(total=max_pages, desc="Scraping pages", unit="page") as pbar: while max_pages is None or page <= max_pages: if page == 1: url = base_url else: url = f"{base_url}/{page}_p" logging.info(f"Scraping page {page}: {url}") content = fetch_data(url) if content: data = parse_data(content) if data: try: house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults'] if house_details: save_to_csv(house_details, output_file, mode='a' if page > 1 else 'w') logging.info( f"Data from page {page} has been saved to house_details-1-10.csv") else: logging.info( f"No more results found on page {page}. Stopping.") break except KeyError as e: logging.error(f"KeyError on page {page}: {e}") logging.error(f"Data structure: {data}") break else: logging.error( f"Failed to parse data from page {page}. Stopping.") break else: logging.error( f"Failed to fetch data from page {page}. Stopping.") break page += 1 pbar.update(1) # Add a delay between requests to be respectful to the server time.sleep(5) logging.info("Scraping completed.") if __name__ == "__main__": main()
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import time
import logging
from tqdm import tqdm
from dotenv import load_dotenv

load_dotenv()

# Set up logging
logging.basicConfig(filename='scraper.log', level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

# Define headers for the HTTP request
HEADERS = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

# Define proxy settings (if needed)
proxy = os.getenv("PROXY")

PROXIES = {
    'http': f'http://{proxy}',
    'https': f'http://{proxy}'
}


def fetch_data(url):
    try:
        response = requests.get(url, headers=HEADERS, proxies=PROXIES)
        response.raise_for_status()
        return response.content
    except requests.RequestException as e:
        logging.error(f"Error fetching data: {e}")
        return None


def parse_data(content):
    try:
        soup = BeautifulSoup(content, 'html.parser')
        script_content = soup.find('script', id='__NEXT_DATA__')

        if script_content:
            json_content = script_content.string
            return json.loads(json_content)
        else:
            logging.error("Could not find the required script tag.")
            return None
    except json.JSONDecodeError as e:
        logging.error(f"Error parsing JSON: {e}")
        return None


def save_to_csv(house_details, output_file, mode='a'):
    with open(output_file, mode, newline='', encoding='utf-8') as csvfile:
        csvwriter = csv.writer(csvfile)

        if mode == 'w':
            csvwriter.writerow([
                'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',
                'STREET', 'CITY', 'STATE', 'ZIP CODE',
                'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',
                'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'
            ])

        for detail in tqdm(house_details, desc="Saving house details", unit="house"):
            try:
                home_info = detail.get('hdpData', {}).get('homeInfo', {})
                photo_urls = ','.join([photo.get('url', '')
                                      for photo in detail.get('carouselPhotos', [])])

                # Concatenate lot area value and unit
                lot_size = f"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}"

                csvwriter.writerow([
                    detail.get('detailUrl', ''),
                    photo_urls,
                    detail.get('price', ''),
                    detail.get('address', ''),
                    detail.get('addressStreet', ''),
                    detail.get('addressCity', ''),
                    detail.get('addressState', ''),
                    detail.get('addressZipcode', ''),
                    home_info.get('bedrooms', ''),
                    home_info.get('bathrooms', ''),
                    home_info.get('livingArea', ''),
                    lot_size,
                    home_info.get('homeType', '').replace('_', ' ')
                ])
            except Exception as e:
                logging.error(f"Error processing house detail: {e}")
                logging.error(f"Problematic detail: {detail}")


def main():
    base_url = "https://www.zillow.com/ne"
    page = 1
    max_pages = 5  # Set this to the number of pages you want to scrape, or None for all pages

    output_directory = 'OUTPUT_1'
    os.makedirs(output_directory, exist_ok=True)
    file_name = f'house_details-1-{max_pages}.csv'
    output_file = os.path.join(output_directory, file_name)

    with tqdm(total=max_pages, desc="Scraping pages", unit="page") as pbar:
        while max_pages is None or page <= max_pages:
            if page == 1:
                url = base_url
            else:
                url = f"{base_url}/{page}_p"

            logging.info(f"Scraping page {page}: {url}")
            content = fetch_data(url)

            if content:
                data = parse_data(content)
                if data:
                    try:
                        house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
                        if house_details:
                            save_to_csv(house_details, output_file,
                                        mode='a' if page > 1 else 'w')
                            logging.info(
                                f"Data from page {page} has been saved to house_details-1-10.csv")
                        else:
                            logging.info(
                                f"No more results found on page {page}. Stopping.")
                            break
                    except KeyError as e:
                        logging.error(f"KeyError on page {page}: {e}")
                        logging.error(f"Data structure: {data}")
                        break
                else:
                    logging.error(
                        f"Failed to parse data from page {page}. Stopping.")
                    break
            else:
                logging.error(
                    f"Failed to fetch data from page {page}. Stopping.")
                break

            page += 1
            pbar.update(1)
            # Add a delay between requests to be respectful to the server
            time.sleep(5)

    logging.info("Scraping completed.")


if __name__ == "__main__":
    main()

[PART 2 ] Scrape the other information from the Properties page

  • Year Built, Description, Listing Date, Days on Zillow, Total Views, Total Saved, Realtor Name, Realtor Contact Number, Agency, Co-realtor Name, Co-realtor contact number, Co-realtor agency

To extract additional information from a Zillow property listing that is not available directly on the search results page, we need to send a GET request to the specific HOUSE URL. This will allow us to gather details such as the year built, description, listing updated date, realtor information, number of views, and number of saves.

First, we will define the HOUSE URL from which we want to extract the additional information. This URL may vary depending on the specific property you are scraping.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
house_url = 'https://www.zillow.com/homedetails/7017-S-132nd-Ave-Omaha-NE-68138/58586050_zpid/'
response = requests.get(house_url, headers=HEADERS, proxies=PROXIES)
soup = BeautifulSoup(response.content, 'html.parser')
house_url = 'https://www.zillow.com/homedetails/7017-S-132nd-Ave-Omaha-NE-68138/58586050_zpid/' response = requests.get(house_url, headers=HEADERS, proxies=PROXIES) soup = BeautifulSoup(response.content, 'html.parser')
house_url = 'https://www.zillow.com/homedetails/7017-S-132nd-Ave-Omaha-NE-68138/58586050_zpid/'

response = requests.get(house_url, headers=HEADERS, proxies=PROXIES)
soup = BeautifulSoup(response.content, 'html.parser')

Since we already have the image urls we will be focusing inside this container which holds the relevant data for extraction.

content container

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
content = soup.find('div', class_='ds-data-view-list')
content = soup.find('div', class_='ds-data-view-list')
content = soup.find('div', class_='ds-data-view-list')

Now let’s extract the Year Built:

year built element

Since there are a few other elements with the same span tag and class name, we’re going to be more specific by finding the element with the text “Built in”

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: "Built in" in text)
year_built = year.text.strip().replace('Built in ', '')
year_built
year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: "Built in" in text) year_built = year.text.strip().replace('Built in ', '') year_built
year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: "Built in" in text)
year_built = year.text.strip().replace('Built in ', '')
year_built

year built result

The property description can be found within a specific div tag identified by its data-testid.

description elements

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
description = content.find('div', attrs={'data-testid': 'description'}).text.strip()
description
description = content.find('div', attrs={'data-testid': 'description'}).text.strip() description
description = content.find('div', attrs={'data-testid': 'description'}).text.strip()
description

Description output

If we notice at the end of the code there is a ‘Show more’ string. So let’s remove this by replacing this string with empty string.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
description = content.find('div', attrs={'data-testid': 'description'}).text.strip().replace('Show more','')
description = content.find('div', attrs={'data-testid': 'description'}).text.strip().replace('Show more','')
description = content.find('div', attrs={'data-testid': 'description'}).text.strip().replace('Show more','')

Get the listing date:

listing date element

Similar to extracting the year built, we will find the listing updated date using a specific class name and filtering for relevant text.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
listing_details = content.find_all('p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and "Listing updated" in text)
date_details = listing_details[0].text.strip()
date_details = listing_details[0].text.strip()
date_part = date_details.split(' at ')[0]
listing_date = date_part.replace('Listing updated: ', '').strip()
listing_details = content.find_all('p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and "Listing updated" in text) date_details = listing_details[0].text.strip() date_details = listing_details[0].text.strip() date_part = date_details.split(' at ')[0] listing_date = date_part.replace('Listing updated: ', '').strip()
listing_details = content.find_all('p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and "Listing updated" in text)
date_details = listing_details[0].text.strip()
date_details = listing_details[0].text.strip()
date_part = date_details.split(' at ')[0]
listing_date = date_part.replace('Listing updated: ', '').strip()

listing date output

Get the days on Zillow, total views and total saved

dt tag

These values can be found within dt tags. We will extract them based on their positions.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
containers = content.find_all('dt')
containers = content.find_all('dt')
containers = content.find_all('dt')

dt container

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
days_on_zillow = containers[0].text.strip()
views = containers[2].text.strip()
total_save = containers[4].text.strip()
days_on_zillow = containers[0].text.strip() views = containers[2].text.strip() total_save = containers[4].text.strip()
days_on_zillow = containers[0].text.strip()
views = containers[2].text.strip()
total_save = containers[4].text.strip()

dt output

Finally, we will extract information about the realtor and their agency from specific p tags.

realtor element tag realtor container

If we expand the p tag, we can see the values that we want inside it.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
realtor_content = content.find('p', attrs={'data-testid': 'attribution-LISTING_AGENT'}).text.strip().replace(',', '')
print('REALTOR:', realtor_content)
realtor_content = content.find('p', attrs={'data-testid': 'attribution-LISTING_AGENT'}).text.strip().replace(',', '') print('REALTOR:', realtor_content)
realtor_content = content.find('p', attrs={'data-testid': 'attribution-LISTING_AGENT'}).text.strip().replace(',', '')
print('REALTOR:', realtor_content)

realtor output details

As we see from the output above, the realtor’s name and contact number are inside the same ‘element’ so let’s separate them to make our data look nice and clean.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
name, contact = realtor_content.split('M:')
realtor_name = name.strip()
realtor_contact = contact.strip()
print('REALTOR NAME:', realtor_name)
print('REALTOR CONTACT NO:', realtor_contact)
name, contact = realtor_content.split('M:') realtor_name = name.strip() realtor_contact = contact.strip() print('REALTOR NAME:', realtor_name) print('REALTOR CONTACT NO:', realtor_contact)
name, contact = realtor_content.split('M:')
realtor_name = name.strip()
realtor_contact = contact.strip()
print('REALTOR NAME:', realtor_name)
print('REALTOR CONTACT NO:', realtor_contact)

realtor output seperate

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
agency_name = content.find('p', attrs={'data-testid': 'attribution-BROKER'}).text.strip().replace(',', '') print('OFFICE:', agency_name)
agency_name = content.find('p', attrs={'data-testid': 'attribution-BROKER'}).text.strip().replace(',', '') print('OFFICE:', agency_name)
agency_name = content.find('p', attrs={'data-testid': 'attribution-BROKER'}).text.strip().replace(',', '') print('OFFICE:', agency_name)

agency name

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
co_realtor_content = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'}).text.strip().replace(',', '') print('CO-REALTOR CONTENT:', co_realtor_content)
co_realtor_content = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'}).text.strip().replace(',', '') print('CO-REALTOR CONTENT:', co_realtor_content)
co_realtor_content = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'}).text.strip().replace(',', '') print('CO-REALTOR CONTENT:', co_realtor_content)

29 co realtor

Same as before we need to split the name and contact number.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
name_contact = co_realtor_content.rsplit(' ', 1)
name = name_contact[0]
contact = name_contact[1]
co_realtor_name = name.strip()
co_realtor_contact = contact.strip()
print(f"CO-REALTOR NAME: {co_realtor_name}")
print(f"CO-REALTOR CONTACT NO: {co_realtor_contact}")
name_contact = co_realtor_content.rsplit(' ', 1) name = name_contact[0] contact = name_contact[1] co_realtor_name = name.strip() co_realtor_contact = contact.strip() print(f"CO-REALTOR NAME: {co_realtor_name}") print(f"CO-REALTOR CONTACT NO: {co_realtor_contact}")
name_contact = co_realtor_content.rsplit(' ', 1)
name = name_contact[0]
contact = name_contact[1]
co_realtor_name = name.strip()
co_realtor_contact = contact.strip()
print(f"CO-REALTOR NAME: {co_realtor_name}")
print(f"CO-REALTOR CONTACT NO: {co_realtor_contact}")

co-realtor separate output

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
co_realtor_agency_name = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'}).text.strip() print('CO-REALTOR AGENCY NAME:', co_realtor_agency_name)
co_realtor_agency_name = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'}).text.strip() print('CO-REALTOR AGENCY NAME:', co_realtor_agency_name)
co_realtor_agency_name = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'}).text.strip() print('CO-REALTOR AGENCY NAME:', co_realtor_agency_name)

co-realtor agency

Complete code with the additional data

Let’s enhance our data collection process by creating a new Python file dedicated to fetching additional information. This script will first read the HOUSE URLs from the existing CSV file, sending requests for each URL to extract valuable data. Once all information is gathered, it will save the results in a new CSV file, preserving the original data for reference.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import time
import logging
from tqdm import tqdm
from dotenv import load_dotenv
load_dotenv()
# Set up logging
logging.basicConfig(filename='scraper.log', level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
# Define headers for the HTTP request
HEADERS = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
# Define proxy settings (if needed)
proxy = os.getenv("PROXY")
PROXIES = {
'http': f'http://{proxy}',
'https': f'http://{proxy}'
}
def fetch_data(url):
try:
response = requests.get(url, headers=HEADERS, proxies=PROXIES)
response.raise_for_status()
return response.content
except requests.RequestException as e:
logging.error(f"Error fetching data: {e}")
return None
def parse_data(content):
try:
soup = BeautifulSoup(content, 'html.parser')
script_content = soup.find('script', id='__NEXT_DATA__')
if script_content:
json_content = script_content.string
return json.loads(json_content)
else:
logging.error("Could not find the required script tag.")
return None
except json.JSONDecodeError as e:
logging.error(f"Error parsing JSON: {e}")
return None
def save_to_csv(house_details, mode='a'):
output_directory = 'OUTPUT_1'
os.makedirs(output_directory, exist_ok=True)
file_name = 'house_details-1-5.csv' # Change accordingly
output_file = os.path.join(output_directory, file_name)
with open(output_file, mode, newline='', encoding='utf-8') as csvfile:
csvwriter = csv.writer(csvfile)
if mode == 'w':
csvwriter.writerow([
'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',
'STREET', 'CITY', 'STATE', 'ZIP CODE',
'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',
'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'
])
for detail in tqdm(house_details, desc="Saving house details", unit="house"):
try:
home_info = detail.get('hdpData', {}).get('homeInfo', {})
photo_urls = ','.join([photo.get('url', '')
for photo in detail.get('carouselPhotos', [])])
# Concatenate lot area value and unit
lot_size = f"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}"
csvwriter.writerow([
detail.get('detailUrl', ''),
photo_urls,
detail.get('price', ''),
detail.get('address', ''),
detail.get('addressStreet', ''),
detail.get('addressCity', ''),
detail.get('addressState', ''),
detail.get('addressZipcode', ''),
home_info.get('bedrooms', ''),
home_info.get('bathrooms', ''),
home_info.get('livingArea', ''),
lot_size,
home_info.get('homeType', '').replace('_', ' ')
])
except Exception as e:
logging.error(f"Error processing house detail: {e}")
logging.error(f"Problematic detail: {detail}")
def main():
base_url = "https://www.zillow.com/ne"
page = 1
max_pages = 5 # Set this to the number of pages you want to scrape, or None for all pages
with tqdm(total=max_pages, desc="Scraping pages", unit="page") as pbar:
while max_pages is None or page <= max_pages:
if page == 1:
url = base_url
else:
url = f"{base_url}/{page}_p"
logging.info(f"Scraping page {page}: {url}")
content = fetch_data(url)
if content:
data = parse_data(content)
if data:
try:
house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
if house_details:
save_to_csv(house_details,
mode='a' if page > 1 else 'w')
logging.info(
f"Data from page {page} has been saved to house_details-1-10.csv")
else:
logging.info(
f"No more results found on page {page}. Stopping.")
break
except KeyError as e:
logging.error(f"KeyError on page {page}: {e}")
logging.error(f"Data structure: {data}")
break
else:
logging.error(
f"Failed to parse data from page {page}. Stopping.")
break
else:
logging.error(
f"Failed to fetch data from page {page}. Stopping.")
break
page += 1
pbar.update(1)
# Add a delay between requests to be respectful to the server
time.sleep(5)
logging.info("Scraping completed.")
if __name__ == "__main__":
main()
import os import requests from bs4 import BeautifulSoup import json import csv import time import logging from tqdm import tqdm from dotenv import load_dotenv load_dotenv() # Set up logging logging.basicConfig(filename='scraper.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') # Define headers for the HTTP request HEADERS = { "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36', "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } # Define proxy settings (if needed) proxy = os.getenv("PROXY") PROXIES = { 'http': f'http://{proxy}', 'https': f'http://{proxy}' } def fetch_data(url): try: response = requests.get(url, headers=HEADERS, proxies=PROXIES) response.raise_for_status() return response.content except requests.RequestException as e: logging.error(f"Error fetching data: {e}") return None def parse_data(content): try: soup = BeautifulSoup(content, 'html.parser') script_content = soup.find('script', id='__NEXT_DATA__') if script_content: json_content = script_content.string return json.loads(json_content) else: logging.error("Could not find the required script tag.") return None except json.JSONDecodeError as e: logging.error(f"Error parsing JSON: {e}") return None def save_to_csv(house_details, mode='a'): output_directory = 'OUTPUT_1' os.makedirs(output_directory, exist_ok=True) file_name = 'house_details-1-5.csv' # Change accordingly output_file = os.path.join(output_directory, file_name) with open(output_file, mode, newline='', encoding='utf-8') as csvfile: csvwriter = csv.writer(csvfile) if mode == 'w': csvwriter.writerow([ 'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE', 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE' ]) for detail in tqdm(house_details, desc="Saving house details", unit="house"): try: home_info = detail.get('hdpData', {}).get('homeInfo', {}) photo_urls = ','.join([photo.get('url', '') for photo in detail.get('carouselPhotos', [])]) # Concatenate lot area value and unit lot_size = f"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}" csvwriter.writerow([ detail.get('detailUrl', ''), photo_urls, detail.get('price', ''), detail.get('address', ''), detail.get('addressStreet', ''), detail.get('addressCity', ''), detail.get('addressState', ''), detail.get('addressZipcode', ''), home_info.get('bedrooms', ''), home_info.get('bathrooms', ''), home_info.get('livingArea', ''), lot_size, home_info.get('homeType', '').replace('_', ' ') ]) except Exception as e: logging.error(f"Error processing house detail: {e}") logging.error(f"Problematic detail: {detail}") def main(): base_url = "https://www.zillow.com/ne" page = 1 max_pages = 5 # Set this to the number of pages you want to scrape, or None for all pages with tqdm(total=max_pages, desc="Scraping pages", unit="page") as pbar: while max_pages is None or page <= max_pages: if page == 1: url = base_url else: url = f"{base_url}/{page}_p" logging.info(f"Scraping page {page}: {url}") content = fetch_data(url) if content: data = parse_data(content) if data: try: house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults'] if house_details: save_to_csv(house_details, mode='a' if page > 1 else 'w') logging.info( f"Data from page {page} has been saved to house_details-1-10.csv") else: logging.info( f"No more results found on page {page}. Stopping.") break except KeyError as e: logging.error(f"KeyError on page {page}: {e}") logging.error(f"Data structure: {data}") break else: logging.error( f"Failed to parse data from page {page}. Stopping.") break else: logging.error( f"Failed to fetch data from page {page}. Stopping.") break page += 1 pbar.update(1) # Add a delay between requests to be respectful to the server time.sleep(5) logging.info("Scraping completed.") if __name__ == "__main__": main()
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import time
import logging
from tqdm import tqdm
from dotenv import load_dotenv

load_dotenv()

# Set up logging
logging.basicConfig(filename='scraper.log', level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

# Define headers for the HTTP request
HEADERS = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

# Define proxy settings (if needed)
proxy = os.getenv("PROXY")

PROXIES = {
    'http': f'http://{proxy}',
    'https': f'http://{proxy}'
}


def fetch_data(url):
    try:
        response = requests.get(url, headers=HEADERS, proxies=PROXIES)
        response.raise_for_status()
        return response.content
    except requests.RequestException as e:
        logging.error(f"Error fetching data: {e}")
        return None


def parse_data(content):
    try:
        soup = BeautifulSoup(content, 'html.parser')
        script_content = soup.find('script', id='__NEXT_DATA__')

        if script_content:
            json_content = script_content.string
            return json.loads(json_content)
        else:
            logging.error("Could not find the required script tag.")
            return None
    except json.JSONDecodeError as e:
        logging.error(f"Error parsing JSON: {e}")
        return None


def save_to_csv(house_details, mode='a'):
    output_directory = 'OUTPUT_1'
    os.makedirs(output_directory, exist_ok=True)
    file_name = 'house_details-1-5.csv'  # Change accordingly
    output_file = os.path.join(output_directory, file_name)

    with open(output_file, mode, newline='', encoding='utf-8') as csvfile:
        csvwriter = csv.writer(csvfile)

        if mode == 'w':
            csvwriter.writerow([
                'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',
                'STREET', 'CITY', 'STATE', 'ZIP CODE',
                'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',
                'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'
            ])

        for detail in tqdm(house_details, desc="Saving house details", unit="house"):
            try:
                home_info = detail.get('hdpData', {}).get('homeInfo', {})
                photo_urls = ','.join([photo.get('url', '')
                                      for photo in detail.get('carouselPhotos', [])])

                # Concatenate lot area value and unit
                lot_size = f"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}"

                csvwriter.writerow([
                    detail.get('detailUrl', ''),
                    photo_urls,
                    detail.get('price', ''),
                    detail.get('address', ''),
                    detail.get('addressStreet', ''),
                    detail.get('addressCity', ''),
                    detail.get('addressState', ''),
                    detail.get('addressZipcode', ''),
                    home_info.get('bedrooms', ''),
                    home_info.get('bathrooms', ''),
                    home_info.get('livingArea', ''),
                    lot_size,
                    home_info.get('homeType', '').replace('_', ' ')
                ])
            except Exception as e:
                logging.error(f"Error processing house detail: {e}")
                logging.error(f"Problematic detail: {detail}")


def main():
    base_url = "https://www.zillow.com/ne"
    page = 1
    max_pages = 5  # Set this to the number of pages you want to scrape, or None for all pages

    with tqdm(total=max_pages, desc="Scraping pages", unit="page") as pbar:
        while max_pages is None or page <= max_pages:
            if page == 1:
                url = base_url
            else:
                url = f"{base_url}/{page}_p"

            logging.info(f"Scraping page {page}: {url}")
            content = fetch_data(url)

            if content:
                data = parse_data(content)
                if data:
                    try:
                        house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
                        if house_details:
                            save_to_csv(house_details,
                                        mode='a' if page > 1 else 'w')
                            logging.info(
                                f"Data from page {page} has been saved to house_details-1-10.csv")
                        else:
                            logging.info(
                                f"No more results found on page {page}. Stopping.")
                            break
                    except KeyError as e:
                        logging.error(f"KeyError on page {page}: {e}")
                        logging.error(f"Data structure: {data}")
                        break
                else:
                    logging.error(
                        f"Failed to parse data from page {page}. Stopping.")
                    break
            else:
                logging.error(
                    f"Failed to fetch data from page {page}. Stopping.")
                break

            page += 1
            pbar.update(1)
            # Add a delay between requests to be respectful to the server
            time.sleep(5)

    logging.info("Scraping completed.")


if __name__ == "__main__":
    main()

Why Create a New File?

The decision to generate a new file instead of overwriting the previous one serves as a safeguard. This approach ensures that we have a backup in case our code encounters issues or if access is blocked, allowing us to maintain data integrity throughout the process.

By implementing this strategy, we not only enhance our data collection capabilities but also ensure that we can troubleshoot effectively without losing any valuable information.

.

Complete code for the additional data with Proxy Rotation

Implementing proxy rotation is essential for avoiding anti-bot detection, especially when making numerous requests to a website. In this tutorial, we will demonstrate how to gather additional data from Zillow property listings while utilizing proxies from Rayobyte, which offers 50MB of residential proxy traffic for free upon signup..

Download and Prepare the Proxy List

Sign Up for Rayobyte: Create an account on Rayobyte to access their proxy services.

Generate Proxy List:

  • Navigate to the “Proxy List Generator” in your dashboard.
  • Set the format to username:password@hostname:port.
  • Download the proxy list.

Move the Proxy File: Locate the downloaded file in your downloads directory and move it to your code directory.

rayobyte dashboard

Implement Proxy Rotation in Your Code

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import os
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import pandas as pd
import random
import time
import logging
from tqdm import tqdm
load_dotenv()
# Set up logging
logging.basicConfig(filename='scraper.log', level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
HEADERS = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
def load_proxies(file_path):
with open(file_path, 'r') as f:
return [line.strip() for line in f if line.strip()]
PROXY_LIST = load_proxies('proxy-list.txt')
def get_random_proxy():
return random.choice(PROXY_LIST)
def get_proxies(proxy):
return {
'http': f'http://{proxy}',
'https': f'http://{proxy}'
}
def scrape_with_retry(url, max_retries=3):
for attempt in range(max_retries):
proxy = get_random_proxy()
proxies = get_proxies(proxy)
try:
response = requests.get(
url, headers=HEADERS, proxies=proxies, timeout=30)
if response.status_code == 200:
return response
else:
logging.warning(
f"Attempt {attempt + 1} failed with status code {response.status_code} for URL: {url}")
except requests.RequestException as e:
logging.error(
f"Attempt {attempt + 1} failed with error: {e} for URL: {url}")
time.sleep(random.uniform(1, 3))
logging.error(
f"Failed to fetch data for {url} after {max_retries} attempts.")
return None
def scrape_house_data(house_url):
response = scrape_with_retry(house_url)
if not response:
return None
soup = BeautifulSoup(response.content, 'html.parser')
content = soup.find('div', class_='ds-data-view-list')
if not content:
logging.error(f"Failed to find content for {house_url}")
return None
year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0',
string=lambda text: "Built in" in text)
year_built = year.text.strip().replace('Built in ', '') if year else "N/A"
description_elem = content.find(
'div', attrs={'data-testid': 'description'})
description = description_elem.text.strip().replace(
'Show more', '') if description_elem else "N/A"
listing_details = content.find_all(
'p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and "Listing updated" in text)
listing_date = "N/A"
if listing_details:
date_details = listing_details[0].text.strip()
date_part = date_details.split(' at ')[0]
listing_date = date_part.replace('Listing updated: ', '').strip()
containers = content.find_all('dt')
days_on_zillow = containers[0].text.strip() if len(
containers) > 0 else "N/A"
views = containers[2].text.strip() if len(containers) > 2 else "N/A"
total_save = containers[4].text.strip() if len(containers) > 4 else "N/A"
realtor_elem = content.find(
'p', attrs={'data-testid': 'attribution-LISTING_AGENT'})
if realtor_elem:
realtor_content = realtor_elem.text.strip().replace(',', '')
if 'M:' in realtor_content:
name, contact = realtor_content.split('M:')
else:
name_contact = realtor_content.rsplit(' ', 1)
name = name_contact[0]
contact = name_contact[1]
realtor_name = name.strip()
realtor_contact = contact.strip()
else:
realtor_name = "N/A"
realtor_contact = "N/A"
agency_elem = content.find(
'p', attrs={'data-testid': 'attribution-BROKER'})
agency_name = agency_elem.text.strip().replace(',', '') if agency_elem else "N/A"
co_realtor_elem = content.find(
'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'})
if co_realtor_elem:
co_realtor_content = co_realtor_elem.text.strip().replace(',', '')
if 'M:' in co_realtor_content:
name, contact = co_realtor_content.split('M:')
else:
name_contact = co_realtor_content.rsplit(' ', 1)
name = name_contact[0]
contact = name_contact[1]
co_realtor_name = name.strip()
co_realtor_contact = contact.strip()
else:
co_realtor_name = "N/A"
co_realtor_contact = "N/A"
co_realtor_agency_elem = content.find(
'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'})
co_realtor_agency_name = co_realtor_agency_elem.text.strip(
) if co_realtor_agency_elem else "N/A"
return {
'YEAR BUILT': year_built,
'DESCRIPTION': description,
'LISTING DATE': listing_date,
'DAYS ON ZILLOW': days_on_zillow,
'TOTAL VIEWS': views,
'TOTAL SAVED': total_save,
'REALTOR NAME': realtor_name,
'REALTOR CONTACT NO': realtor_contact,
'AGENCY': agency_name,
'CO-REALTOR NAME': co_realtor_name,
'CO-REALTOR CONTACT NO': co_realtor_contact,
'CO-REALTOR AGENCY': co_realtor_agency_name
}
def ensure_output_directory(directory):
if not os.path.exists(directory):
os.makedirs(directory)
logging.info(f"Created output directory: {directory}")
def load_progress(output_file):
if os.path.exists(output_file):
return pd.read_csv(output_file)
return pd.DataFrame()
def save_progress(df, output_file):
df.to_csv(output_file, index=False)
logging.info(f"Progress saved to {output_file}")
def main():
input_file = './OUTPUT_1/house_details.csv'
output_directory = 'OUTPUT_2'
file_name = 'house_details_scraped.csv'
output_file = os.path.join(output_directory, file_name)
ensure_output_directory(output_directory)
df = pd.read_csv(input_file)
# Load existing progress
result_df = load_progress(output_file)
# Determine which URLs have already been scraped
scraped_urls = set(result_df['HOUSE URL']
) if 'HOUSE URL' in result_df.columns else set()
# Scrape data for each house URL
for _, row in tqdm(df.iterrows(), total=df.shape[0], desc="Scraping Progress"):
house_url = row['HOUSE URL']
# Skip if already scraped
if house_url in scraped_urls:
continue
logging.info(f"Scraping data for {house_url}")
data = scrape_house_data(house_url)
if data:
# Combine the original row data with the scraped data
combined_data = {**row.to_dict(), **data}
new_row = pd.DataFrame([combined_data])
# Append the new row to the result DataFrame
result_df = pd.concat([result_df, new_row], ignore_index=True)
# Save progress after each successful scrape
save_progress(result_df, output_file)
# Add a random delay between requests (1 to 5 seconds)
time.sleep(random.uniform(1, 5))
logging.info(f"Scraping completed. Final results saved to {output_file}")
print(
f"Scraping completed. Check {output_file} for results and scraper.log for detailed logs.")
if __name__ == "__main__":
main()
import os import requests from bs4 import BeautifulSoup from dotenv import load_dotenv import pandas as pd import random import time import logging from tqdm import tqdm load_dotenv() # Set up logging logging.basicConfig(filename='scraper.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S') HEADERS = { "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36', "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", } def load_proxies(file_path): with open(file_path, 'r') as f: return [line.strip() for line in f if line.strip()] PROXY_LIST = load_proxies('proxy-list.txt') def get_random_proxy(): return random.choice(PROXY_LIST) def get_proxies(proxy): return { 'http': f'http://{proxy}', 'https': f'http://{proxy}' } def scrape_with_retry(url, max_retries=3): for attempt in range(max_retries): proxy = get_random_proxy() proxies = get_proxies(proxy) try: response = requests.get( url, headers=HEADERS, proxies=proxies, timeout=30) if response.status_code == 200: return response else: logging.warning( f"Attempt {attempt + 1} failed with status code {response.status_code} for URL: {url}") except requests.RequestException as e: logging.error( f"Attempt {attempt + 1} failed with error: {e} for URL: {url}") time.sleep(random.uniform(1, 3)) logging.error( f"Failed to fetch data for {url} after {max_retries} attempts.") return None def scrape_house_data(house_url): response = scrape_with_retry(house_url) if not response: return None soup = BeautifulSoup(response.content, 'html.parser') content = soup.find('div', class_='ds-data-view-list') if not content: logging.error(f"Failed to find content for {house_url}") return None year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: "Built in" in text) year_built = year.text.strip().replace('Built in ', '') if year else "N/A" description_elem = content.find( 'div', attrs={'data-testid': 'description'}) description = description_elem.text.strip().replace( 'Show more', '') if description_elem else "N/A" listing_details = content.find_all( 'p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and "Listing updated" in text) listing_date = "N/A" if listing_details: date_details = listing_details[0].text.strip() date_part = date_details.split(' at ')[0] listing_date = date_part.replace('Listing updated: ', '').strip() containers = content.find_all('dt') days_on_zillow = containers[0].text.strip() if len( containers) > 0 else "N/A" views = containers[2].text.strip() if len(containers) > 2 else "N/A" total_save = containers[4].text.strip() if len(containers) > 4 else "N/A" realtor_elem = content.find( 'p', attrs={'data-testid': 'attribution-LISTING_AGENT'}) if realtor_elem: realtor_content = realtor_elem.text.strip().replace(',', '') if 'M:' in realtor_content: name, contact = realtor_content.split('M:') else: name_contact = realtor_content.rsplit(' ', 1) name = name_contact[0] contact = name_contact[1] realtor_name = name.strip() realtor_contact = contact.strip() else: realtor_name = "N/A" realtor_contact = "N/A" agency_elem = content.find( 'p', attrs={'data-testid': 'attribution-BROKER'}) agency_name = agency_elem.text.strip().replace(',', '') if agency_elem else "N/A" co_realtor_elem = content.find( 'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'}) if co_realtor_elem: co_realtor_content = co_realtor_elem.text.strip().replace(',', '') if 'M:' in co_realtor_content: name, contact = co_realtor_content.split('M:') else: name_contact = co_realtor_content.rsplit(' ', 1) name = name_contact[0] contact = name_contact[1] co_realtor_name = name.strip() co_realtor_contact = contact.strip() else: co_realtor_name = "N/A" co_realtor_contact = "N/A" co_realtor_agency_elem = content.find( 'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'}) co_realtor_agency_name = co_realtor_agency_elem.text.strip( ) if co_realtor_agency_elem else "N/A" return { 'YEAR BUILT': year_built, 'DESCRIPTION': description, 'LISTING DATE': listing_date, 'DAYS ON ZILLOW': days_on_zillow, 'TOTAL VIEWS': views, 'TOTAL SAVED': total_save, 'REALTOR NAME': realtor_name, 'REALTOR CONTACT NO': realtor_contact, 'AGENCY': agency_name, 'CO-REALTOR NAME': co_realtor_name, 'CO-REALTOR CONTACT NO': co_realtor_contact, 'CO-REALTOR AGENCY': co_realtor_agency_name } def ensure_output_directory(directory): if not os.path.exists(directory): os.makedirs(directory) logging.info(f"Created output directory: {directory}") def load_progress(output_file): if os.path.exists(output_file): return pd.read_csv(output_file) return pd.DataFrame() def save_progress(df, output_file): df.to_csv(output_file, index=False) logging.info(f"Progress saved to {output_file}") def main(): input_file = './OUTPUT_1/house_details.csv' output_directory = 'OUTPUT_2' file_name = 'house_details_scraped.csv' output_file = os.path.join(output_directory, file_name) ensure_output_directory(output_directory) df = pd.read_csv(input_file) # Load existing progress result_df = load_progress(output_file) # Determine which URLs have already been scraped scraped_urls = set(result_df['HOUSE URL'] ) if 'HOUSE URL' in result_df.columns else set() # Scrape data for each house URL for _, row in tqdm(df.iterrows(), total=df.shape[0], desc="Scraping Progress"): house_url = row['HOUSE URL'] # Skip if already scraped if house_url in scraped_urls: continue logging.info(f"Scraping data for {house_url}") data = scrape_house_data(house_url) if data: # Combine the original row data with the scraped data combined_data = {**row.to_dict(), **data} new_row = pd.DataFrame([combined_data]) # Append the new row to the result DataFrame result_df = pd.concat([result_df, new_row], ignore_index=True) # Save progress after each successful scrape save_progress(result_df, output_file) # Add a random delay between requests (1 to 5 seconds) time.sleep(random.uniform(1, 5)) logging.info(f"Scraping completed. Final results saved to {output_file}") print( f"Scraping completed. Check {output_file} for results and scraper.log for detailed logs.") if __name__ == "__main__": main()
import os
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import pandas as pd
import random
import time
import logging
from tqdm import tqdm

load_dotenv()

# Set up logging
logging.basicConfig(filename='scraper.log', level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S')

HEADERS = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}


def load_proxies(file_path):
    with open(file_path, 'r') as f:
        return [line.strip() for line in f if line.strip()]


PROXY_LIST = load_proxies('proxy-list.txt')


def get_random_proxy():
    return random.choice(PROXY_LIST)


def get_proxies(proxy):
    return {
        'http': f'http://{proxy}',
        'https': f'http://{proxy}'
    }


def scrape_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        proxy = get_random_proxy()
        proxies = get_proxies(proxy)
        try:
            response = requests.get(
                url, headers=HEADERS, proxies=proxies, timeout=30)
            if response.status_code == 200:
                return response
            else:
                logging.warning(
                    f"Attempt {attempt + 1} failed with status code {response.status_code} for URL: {url}")
        except requests.RequestException as e:
            logging.error(
                f"Attempt {attempt + 1} failed with error: {e} for URL: {url}")

        time.sleep(random.uniform(1, 3))

    logging.error(
        f"Failed to fetch data for {url} after {max_retries} attempts.")
    return None


def scrape_house_data(house_url):
    response = scrape_with_retry(house_url)
    if not response:
        return None

    soup = BeautifulSoup(response.content, 'html.parser')
    content = soup.find('div', class_='ds-data-view-list')

    if not content:
        logging.error(f"Failed to find content for {house_url}")
        return None

    year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0',
                        string=lambda text: "Built in" in text)
    year_built = year.text.strip().replace('Built in ', '') if year else "N/A"

    description_elem = content.find(
        'div', attrs={'data-testid': 'description'})
    description = description_elem.text.strip().replace(
        'Show more', '') if description_elem else "N/A"

    listing_details = content.find_all(
        'p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and "Listing updated" in text)
    listing_date = "N/A"
    if listing_details:
        date_details = listing_details[0].text.strip()
        date_part = date_details.split(' at ')[0]
        listing_date = date_part.replace('Listing updated: ', '').strip()

    containers = content.find_all('dt')
    days_on_zillow = containers[0].text.strip() if len(
        containers) > 0 else "N/A"
    views = containers[2].text.strip() if len(containers) > 2 else "N/A"
    total_save = containers[4].text.strip() if len(containers) > 4 else "N/A"

    realtor_elem = content.find(
        'p', attrs={'data-testid': 'attribution-LISTING_AGENT'})
    if realtor_elem:
        realtor_content = realtor_elem.text.strip().replace(',', '')
        if 'M:' in realtor_content:
            name, contact = realtor_content.split('M:')
        else:
            name_contact = realtor_content.rsplit(' ', 1)
            name = name_contact[0]
            contact = name_contact[1]

        realtor_name = name.strip()
        realtor_contact = contact.strip()

    else:
        realtor_name = "N/A"
        realtor_contact = "N/A"

    agency_elem = content.find(
        'p', attrs={'data-testid': 'attribution-BROKER'})
    agency_name = agency_elem.text.strip().replace(',', '') if agency_elem else "N/A"

    co_realtor_elem = content.find(
        'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'})
    if co_realtor_elem:
        co_realtor_content = co_realtor_elem.text.strip().replace(',', '')
        if 'M:' in co_realtor_content:
            name, contact = co_realtor_content.split('M:')
        else:
            name_contact = co_realtor_content.rsplit(' ', 1)
            name = name_contact[0]
            contact = name_contact[1]

        co_realtor_name = name.strip()
        co_realtor_contact = contact.strip()

    else:
        co_realtor_name = "N/A"
        co_realtor_contact = "N/A"

    co_realtor_agency_elem = content.find(
        'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'})
    co_realtor_agency_name = co_realtor_agency_elem.text.strip(
    ) if co_realtor_agency_elem else "N/A"

    return {
        'YEAR BUILT': year_built,
        'DESCRIPTION': description,
        'LISTING DATE': listing_date,
        'DAYS ON ZILLOW': days_on_zillow,
        'TOTAL VIEWS': views,
        'TOTAL SAVED': total_save,
        'REALTOR NAME': realtor_name,
        'REALTOR CONTACT NO': realtor_contact,
        'AGENCY': agency_name,
        'CO-REALTOR NAME': co_realtor_name,
        'CO-REALTOR CONTACT NO': co_realtor_contact,
        'CO-REALTOR AGENCY': co_realtor_agency_name
    }


def ensure_output_directory(directory):
    if not os.path.exists(directory):
        os.makedirs(directory)
        logging.info(f"Created output directory: {directory}")


def load_progress(output_file):
    if os.path.exists(output_file):
        return pd.read_csv(output_file)
    return pd.DataFrame()


def save_progress(df, output_file):
    df.to_csv(output_file, index=False)
    logging.info(f"Progress saved to {output_file}")


def main():
    input_file = './OUTPUT_1/house_details.csv'

    output_directory = 'OUTPUT_2'
    file_name = 'house_details_scraped.csv'
    output_file = os.path.join(output_directory, file_name)
    ensure_output_directory(output_directory)

    df = pd.read_csv(input_file)

    # Load existing progress
    result_df = load_progress(output_file)

    # Determine which URLs have already been scraped
    scraped_urls = set(result_df['HOUSE URL']
                      ) if 'HOUSE URL' in result_df.columns else set()

    # Scrape data for each house URL
    for _, row in tqdm(df.iterrows(), total=df.shape[0], desc="Scraping Progress"):
        house_url = row['HOUSE URL']

        # Skip if already scraped
        if house_url in scraped_urls:
            continue

        logging.info(f"Scraping data for {house_url}")
        data = scrape_house_data(house_url)

        if data:
            # Combine the original row data with the scraped data
            combined_data = {**row.to_dict(), **data}
            new_row = pd.DataFrame([combined_data])

            # Append the new row to the result DataFrame
            result_df = pd.concat([result_df, new_row], ignore_index=True)

            # Save progress after each successful scrape
            save_progress(result_df, output_file)

        # Add a random delay between requests (1 to 5 seconds)
        time.sleep(random.uniform(1, 5))

    logging.info(f"Scraping completed. Final results saved to {output_file}")
    print(
        f"Scraping completed. Check {output_file} for results and scraper.log for detailed logs.")


if __name__ == "__main__":
    main()

Conclusion

In conclusion, this comprehensive guide on Zillow scraping with Python has equipped you with essential tools and techniques to effectively extract property listings and home prices. By following the outlined steps, you have learned how to navigate the complexities of web scraping, including overcoming anti-bot measures and utilizing proxies for seamless data retrieval.

Key takeaways from this tutorial include:

  • Understanding the Ethical Considerations: Emphasizing responsible scraping practices to respect website performance and legal guidelines.
  • Scraping Workflow: Dividing the scraping process into manageable parts for clarity and efficiency.
  • Technical Implementation: Utilizing Python libraries such as requests, BeautifulSoup, and json for data extraction.
  • Data Storage: Saving extracted information in CSV format for easy access and analysis.

As you implement these strategies, you will gain valuable insights into real estate trends and market dynamics, empowering you to make informed decisions based on the data collected. With the provided source code and detailed explanations, you are now well-prepared to adapt this project to your specific needs, whether that involves expanding your data collection or refining your analysis techniques. Embrace the power of data-driven insights as you explore the vast landscape of real estate information available through platforms like Zillow. Drop a comment below if you have any questions and Happy scraping!

Source code: zillow_properties_for_sale_scraper

Video: Extract data from Zillow properties for sale listing using Python

Responses

Related Projects

google shopping scraper python
yahoo search
Bing search 1
b9929b09 167f 4365 9087 fddf3278a679