{"id":1206,"date":"2024-10-18T17:03:51","date_gmt":"2024-10-18T17:03:51","guid":{"rendered":"https:\/\/rayobyte.com\/community\/?post_type=scraping_project&#038;p=1206"},"modified":"2024-10-24T16:59:13","modified_gmt":"2024-10-24T16:59:13","slug":"zillow-scraping-with-python-extract-property-listings-and-home-prices","status":"publish","type":"scraping_project","link":"https:\/\/rayobyte.com\/community\/scraping-project\/zillow-scraping-with-python-extract-property-listings-and-home-prices\/","title":{"rendered":"Zillow Scraping with Python: Extract Property Listings and Home Prices"},"content":{"rendered":"<h1><span style=\"font-weight: 400;\">Zillow Scraping with Python: Extract Property Listings and Home Prices<\/span><\/h1>\n<p><iframe loading=\"lazy\" title=\"Extract data from Zillow properties for sale listing using Python\" width=\"640\" height=\"360\" src=\"https:\/\/www.youtube.com\/embed\/CyHNw0xwp8E?feature=oembed&#038;enablejsapi=1&#038;origin=https:\/\/rayobyte.com\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><\/p>\n<p><span style=\"font-weight: 400;\">Source code: <\/span><a href=\"https:\/\/github.com\/ainacodes\/zillow_properties_for_sale_scraper\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">zillow_properties_for_sale_scraper<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Table of Content<\/span><\/h2>\n<p><a href=\"#introduction\">Introduction<\/a><br \/>\n<a href=\"#ethical-consideration\">Ethical Consideration<\/a><br \/>\n<a href=\"#scraping-workflow\">Scraping Workflow<\/a><br \/>\n<a href=\"#prerequisites\">Prerequisites<\/a><br \/>\n<a href=\"#project-setup\">Project Setup<\/a><br \/>\n<a href=\"#part-1\">[PART 1] Scraping Zillow Data from the search page<\/a><br \/>\n<a href=\"#complete-code-first-page\">Complete code for the first page<\/a><br \/>\n<a href=\"#next-page\">Get the information from the next page<\/a><br \/>\n<a href=\"#complete-code-all-pages\">Complete code for all pages<\/a><br \/>\n<a href=\"#part-2\">[PART 2 ] Scrape the other information from the Properties page<\/a><br \/>\n<a href=\"#\">Complete code for the additional data<\/a><br \/>\n<a href=\"#\">Complete code with the additional data with Proxy Rotation<\/a><br \/>\n<a href=\"#conclusions\">Conclusion<\/a><\/p>\n<h2 id=\"introduction\"><span style=\"font-weight: 400;\">Introduction<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Zillow is a go-to platform for real estate data, featuring millions of property listings with detailed information on prices, locations, and home features. In this tutorial, we\u2019ll guide you through the process of Zillow scraping using Python.\u00a0 You will learn how to extract essential property details such as home prices and geographic data, which will empower you to track market trends, analyze property values, and compare listings across various regions. This guide includes source code and techniques to effectively implement Zillow scraping.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In this tutorial, we will focus on collecting data for houses listed for sale in Nebraska. Our starting URL will be:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Base URL: <\/span><a href=\"https:\/\/www.zillow.com\/ne\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">https:\/\/www.zillow.com\/ne<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">The information that we want to scrape are:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">House URL<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Images<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Price<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Address<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Number of bedroom(s)<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Number of bathroom(s)<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">House Size<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Lot Size<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">House Type<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Year Built<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Description<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Listing Date<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Days on Zillow<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Total Views<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Total Saved<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Realtor Name<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Realtor Contact Number<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Agency<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Co-realtor Name<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Co-realtor contact number<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Co-realtor agency<\/span><\/li>\n<\/ul>\n<h2 id=\"ethical-consideration\"><span style=\"font-weight: 400;\">Ethical Consideration<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Before we dive into the technical aspects of scraping Zillow, it&#8217;s important to emphasize that this tutorial is intended for <\/span><b>educational purposes<\/b><span style=\"font-weight: 400;\"> only. When interacting with public servers, it&#8217;s vital to maintain a responsible approach. Here are some essential guidelines to keep in mind:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Respect Website Performance: Avoid scraping at a speed that could negatively impact the website\u2019s performance or availability.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Public Data Only: Ensure that you only scrape data that is publicly accessible. Respect any restrictions set by the website.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">No Redistribution of Data: Refrain from redistributing entire public datasets, as this may violate legal regulations in certain jurisdictions.<\/span><\/li>\n<\/ul>\n<h2 id=\"scraping-workflow\"><span style=\"font-weight: 400;\">Scraping Workflow<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The Zillow scraper can be effectively divided into <strong>two parts<\/strong>, each focusing on different aspects of data extraction.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The first part involves extracting essential information from the Zillow search results page which consists of this information.<\/span><\/p>\n<p><span style=\"font-weight: 400;\"><code>HOUSE URLs<\/code>, <code>PHOTO URLs<\/code>, <code>PRICE<\/code>, <code>FULL ADDRESS<\/code>, <code>STREET<\/code>, <code>CITY<\/code>, <code>STATE<\/code>, <code>ZIP CODE<\/code>, <code>NUMBER OF BEDROOMS<\/code>, <code>NUMBER OF BATHROOMS<\/code>, <code>HOUSE SIZE<\/code>, <code>LOT SIZE<\/code> and <code>HOUSE TYPE<\/code><\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1207 size-full\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/01-searchpage_info.png\" alt=\"search page\" width=\"1301\" height=\"1032\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/01-searchpage_info.png 1301w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/01-searchpage_info-300x238.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/01-searchpage_info-1024x812.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/01-searchpage_info-768x609.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/01-searchpage_info-624x495.png 624w\" sizes=\"auto, (max-width: 1301px) 100vw, 1301px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">It is important to note that while the search page provides a wealth of information, it does not display <code>LOT SIZE<\/code> and <code>HOUSE TYPE<\/code> directly. However, these values are accessible through the backend which I\u2019ll show you later.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The second part is to scrape the rest of the information from the particular <code>HOUSE URLs<\/code> page which includes:<\/span><\/p>\n<p><span style=\"font-weight: 400;\"><code>YEAR BUILT<\/code>, <code>DESCRIPTION<\/code>, <code>LISTING DATE<\/code>, <code>DAYS ON ZILLOW<\/code>, <code>TOTAL VIEWS<\/code>,\u00a0 <code>TOTAL SAVED<\/code>, <code>REALTOR NAME<\/code>, <code>REALTOR CONTACT NO<\/code>, <code>AGENCY<\/code>, <code>CO-REALTOR NAME<\/code>, <code>CO-REALTOR CONTACT NO<\/code> and <code>CO-REALTOR AGENCY<\/code><\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1208 size-full\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/02-housepage_details.png\" alt=\"House page\" width=\"1301\" height=\"1032\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/02-housepage_details.png 1301w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/02-housepage_details-300x238.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/02-housepage_details-1024x812.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/02-housepage_details-768x609.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/02-housepage_details-624x495.png 624w\" sizes=\"auto, (max-width: 1301px) 100vw, 1301px\" \/><\/p>\n<h2 id=\"prerequisites\"><span style=\"font-weight: 400;\">Prerequisites<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Before starting this project, ensure you have the following:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><strong>Python Installed<\/strong>: Make sure Python is installed on your machine.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><strong>Proxy Usage<\/strong>: It is highly recommended to use a proxy for this project to avoid detection and potential blocking. For this tutorial, we will use a residential proxy from <\/span><a href=\"https:\/\/rayobyte.com\/products\/residential-proxies\/\"><span style=\"font-weight: 400;\">Rayobyte<\/span><\/a><span style=\"font-weight: 400;\">. You can sign up for a free trial that offers 50MB of usage without requiring a credit card.<\/span><\/li>\n<\/ul>\n<h2 id=\"project-setup\"><span style=\"font-weight: 400;\">Project Setup<\/span><\/h2>\n<ol>\n<li><span style=\"font-weight: 400;\">Create a new folder in your desired directory to house your project files.<\/span><\/li>\n<li>Open your terminal in the directory you just created and run the following command to install the necessary libraries:<\/li>\n<\/ol>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">pip install requests beautifulsoup4<\/pre>\n<p><span style=\"font-weight: 400;\">3. If you are using a proxy, I suggest you install the <\/span><a href=\"https:\/\/pypi.org\/project\/python-dotenv\/\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">python-dotenv<\/span><\/a><span style=\"font-weight: 400;\"> package as well.\u00a0 To store your credentials in <code>.env<\/code> file<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">pip install python-dotenv<\/pre>\n<p><span style=\"font-weight: 400;\">4. Open your preferred code editor (for example, Visual Studio Code) and create a new file with the extension <code>.ipynb<\/code>. This will create a new Jupyter notebook within VS Code.<\/span><\/p>\n<h2 id=\"part-1\"><span style=\"font-weight: 400;\">[PART 1] Scraping Zillow Data from the search page<\/span><\/h2>\n<ul>\n<li>House URL, Images, Price, Address, Number of bedroom(s), Number of bathroom(s), House Size, Lot Size and House Type<\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">In this section, we will implement the code to scrape property data from Zillow. We will cover everything from importing libraries to saving the extracted information in a CSV file.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">First, we need to import the libraries that will help us with HTTP requests and HTML parsing.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import requests\r\nfrom bs4 import BeautifulSoup\r\nimport json<\/pre>\n<p><span style=\"font-weight: 400;\">Setting headers helps disguise our request as if it\u2019s coming from a real browser, which can help avoid detection.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">headers = {\r\n\u00a0 \u00a0 \"User-Agent\": Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/119.0.0.0 Safari\/537.36,\r\n\u00a0 \u00a0 \"Accept\": \"text\/html,application\/xhtml+xml,application\/xml;q=0.9,image\/webp,*\/*;q=0.8\",\r\n\u00a0 \u00a0 \"Accept-Language\": \"en-US,en;q=0.5\",\r\n\u00a0 \u00a0 \"Accept-Encoding\": \"gzip, deflate, br\",\r\n\u00a0 \u00a0 \"DNT\": \"1\",\r\n\u00a0 \u00a0 \"Connection\": \"keep-alive\",\r\n\u00a0 \u00a0 \"Upgrade-Insecure-Requests\": \"1\",\r\n}<\/pre>\n<p><span style=\"font-weight: 400;\">If you have a proxy, include it in your requests to avoid potential blocks.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">proxies = {\r\n\u00a0 \u00a0 'http': 'http:\/\/username:password@host:port',\r\n\u00a0 \u00a0 'https': 'http:\/\/username:password@host:port'\r\n}<\/pre>\n<p><span style=\"font-weight: 400;\">Make sure to replace username, password, host, and port with your actual proxy credentials.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Or you can create a <code>.env<\/code> file to store your proxy credentials and load you proxies like this<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">from dotenv import load_dotenv\r\nload_dotenv()\r\n\r\nproxy = os.getenv(\"PROXY\")\r\nPROXIES = {\r\n\u00a0 \u00a0 'http': f'http:\/\/{proxy}',\r\n\u00a0 \u00a0 'https': f'http:\/\/{proxy}'\r\n}<\/pre>\n<p><span style=\"font-weight: 400;\">Define the URL for the state you want to scrape\u2014in this case, Nebraska.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">url = \"https:\/\/www.zillow.com\/ne\"<\/pre>\n<p><span style=\"font-weight: 400;\">Send a GET request to the server using the headers and proxies defined earlier.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">response = requests.get(url, headers=headers, proxies=proxies)\u00a0 # Use proxies if available\r\n# If you don't have a proxy:\r\n# response = requests.get(url, headers=headers)<\/pre>\n<p><span style=\"font-weight: 400;\">Use BeautifulSoup to parse the HTML content of the page.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">soup = BeautifulSoup(response.content, 'html.parser')<\/pre>\n<h3><span style=\"font-weight: 400;\">Extract House URLs from Listing Cards<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">The first thing that we want from the first landing page is to extract all the House URLs. Normally these URLs are available inside the &#8220;listing cards&#8221;.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1210 size-full\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/03-listing_card.png\" alt=\"Listing card\" width=\"1920\" height=\"1020\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/03-listing_card.png 1920w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/03-listing_card-300x159.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/03-listing_card-1024x544.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/03-listing_card-768x408.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/03-listing_card-1536x816.png 1536w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/03-listing_card-624x332.png 624w\" sizes=\"auto, (max-width: 1920px) 100vw, 1920px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">Inspecting the &#8220;Listing Card&#8221;<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">To inspect the element, &#8220;<strong>right-click<\/strong>&#8221; anywhere and click on &#8220;<strong>Inspect<\/strong>&#8221; or simply press <strong>F12<\/strong>. Click on this arrow icon and start hovering on the element that we want.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1211\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/04-hover_arrow.png\" alt=\"Hover arrow icon\" width=\"367\" height=\"271\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/04-hover_arrow.png 367w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/04-hover_arrow-300x222.png 300w\" sizes=\"auto, (max-width: 367px) 100vw, 367px\" \/> <img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1212 size-full\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/05-listing_card_element.png\" alt=\"listing card element\" width=\"731\" height=\"689\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/05-listing_card_element.png 731w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/05-listing_card_element-300x283.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/05-listing_card_element-624x588.png 624w\" sizes=\"auto, (max-width: 731px) 100vw, 731px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">listing_card = soup.find_all('li', class_='ListItem-c11n-8-105-0__sc-13rwu5a-0')\r\nprint(len(listing_card))<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1213\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/06-card_len.png\" alt=\"listing len\" width=\"744\" height=\"118\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/06-card_len.png 744w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/06-card_len-300x48.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/06-card_len-624x99.png 624w\" sizes=\"auto, (max-width: 744px) 100vw, 744px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">As we see here there are 42 listings inside this first page.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Now, let&#8217;s try getting the <strong>url<\/strong>. If we expand the <code>li<\/code> tag we will notice there are <code>a<\/code> tag and the <strong>url<\/strong> is in the <code>href<\/code> attribute.:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1214\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/07-house_url.png\" alt=\"House url html tag\" width=\"684\" height=\"144\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/07-house_url.png 684w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/07-house_url-300x63.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/07-house_url-624x131.png 624w\" sizes=\"auto, (max-width: 684px) 100vw, 684px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">To get this value, let\u2019s test by extracting inside the first listing. Therefore, we need to specify that we want the information from the first only.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">card = listing_card[0]<\/pre>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_url = card.find('a').get('href')\r\nprint('URL:', house_url)<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1217\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/08-url_result.png\" alt=\"house url result\" width=\"789\" height=\"225\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/08-url_result.png 789w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/08-url_result-300x86.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/08-url_result-768x219.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/08-url_result-624x178.png 624w\" sizes=\"auto, (max-width: 789px) 100vw, 789px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">This works fine until here. However, as you may know or not, Zillow has strong anti-bot detection mechanisms. Therefore by using this method, you\u2019ll get the response to <strong>10 urls only instead of 42<\/strong>, which is the total listing that appears on the first page.\u00a0<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Overcome Anti-Bot Detection by Extracting the data from JSON format<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">To overcome this issue, I found another approach by using the \u201cJavascript-rendered\u201d value that returns from the web page. If we scroll down the \u201cinspect\u201d page, we will find a script tag with the <strong>id=&#8221;__NEXT_DATA__&#8221;<\/strong><\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">content = soup.find('script', id='__NEXT_DATA__')<\/pre>\n<p><span style=\"font-weight: 400;\">Convert the content to json format.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import json<\/pre>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">json_content = content.string\r\ndata = json.loads(json_content)<\/pre>\n<p><span style=\"font-weight: 400;\">Save this JSON data for easier inspection later:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">with open('output.json', 'w') as json_file:\r\n\u00a0 \u00a0 json.dump(data, json_file, indent=4)<\/pre>\n<p><span style=\"font-weight: 400;\">After running this code, you\u2019ll get the <code>output.json<\/code> file inside your folder.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Open the file to locate the URL. I\u2019m using <code>ctrl+f<\/code> to find the URL location inside my VScode.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-1218\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/09-output_1_json-1024x552.png\" alt=\"json output\" width=\"640\" height=\"345\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/09-output_1_json-1024x552.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/09-output_1_json-300x162.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/09-output_1_json-768x414.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/09-output_1_json-624x336.png 624w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/09-output_1_json.png 1500w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Notice here the URL is inside the <code>\"detailUrl\"<\/code>. Apart from that, it returns other useful information as well.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To extract the value inside this json file:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']<\/pre>\n<p><span style=\"font-weight: 400;\">Get the first listing<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">detail = house_details[0]<\/pre>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_url = detail['detailUrl']\r\nhouse_url<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-1221\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/10-house_json_url-1024x162.png\" alt=\"\" width=\"640\" height=\"101\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/10-house_json_url-1024x162.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/10-house_json_url-300x47.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/10-house_json_url-768x121.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/10-house_json_url-624x99.png 624w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/10-house_json_url.png 1317w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">We get the same value as before.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To get all the URLs from the first page:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_urls = [detail['detailUrl'] for detail in house_details]<\/pre>\n<p><span style=\"font-weight: 400;\">By using this method, we are able to get all the URL.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1220\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/11-house_url_all.png\" alt=\"all house url\" width=\"817\" height=\"699\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/11-house_url_all.png 817w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/11-house_url_all-300x257.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/11-house_url_all-768x657.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/11-house_url_all-624x534.png 624w\" sizes=\"auto, (max-width: 817px) 100vw, 817px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">As we inspect our json file, we can see other information that we\u2019re interested in as well. So let\u2019s get these values from here.\u00a0<\/span><\/p>\n<p><b>Image URLs<\/b><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">photo_urls = [photo['url'] for photo in detail['carouselPhotos']]<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-1222\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/12-photo_urls-1024x630.png\" alt=\"Photos URLs\" width=\"640\" height=\"394\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/12-photo_urls-1024x630.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/12-photo_urls-300x185.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/12-photo_urls-768x472.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/12-photo_urls-624x384.png 624w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/12-photo_urls.png 1216w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">price = detail['price']\r\nfull_address = detail['address']\r\naddress_street = detail['addressStreet']\r\ncity = detail['addressCity']\r\nstate = detail['addressState']\r\nzipcode = detail['addressZipcode']\r\nhome_info = detail['hdpData']['homeInfo']\r\nbedrooms = home_info['bedrooms']\r\nbathrooms = home_info['bathrooms']\r\nhouse_size = home_info['livingArea']\r\nlot_size = home_info['lotAreaValue']\r\nhouse_type = home_info['homeType']<\/pre>\n<h3><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1223\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/13-part_1_data.png\" alt=\"\" width=\"708\" height=\"344\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/13-part_1_data.png 708w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/13-part_1_data-300x146.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/13-part_1_data-624x303.png 624w\" sizes=\"auto, (max-width: 708px) 100vw, 708px\" \/><\/h3>\n<h3><span style=\"font-weight: 400;\">Save all the information in CSV file<\/span><\/h3>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import csv\r\n\r\n# Open a new CSV file for writing\r\nwith open('house_details.csv', 'w', newline='', encoding='utf-8') as csvfile:\r\n\u00a0 \u00a0 # Create a CSV writer object\r\n\u00a0 \u00a0 csvwriter = csv.writer(csvfile)\r\n\u00a0 \u00a0\r\n\u00a0 \u00a0 # Write the header row\r\n\u00a0 \u00a0 csvwriter.writerow(['HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS', 'STREET', 'CITY', 'STATE', 'ZIP CODE',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOM', 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'])\r\n\u00a0 \u00a0\r\n\u00a0 \u00a0 # Iterate through the house details and write each row\r\n\u00a0 \u00a0 for detail in house_details:\r\n\u00a0 \u00a0 \u00a0 \u00a0 house_url = detail['detailUrl']\r\n\u00a0 \u00a0 \u00a0 \u00a0 photo_urls = ','.join([photo['url'] for photo in detail['carouselPhotos']])\r\n\u00a0 \u00a0 \u00a0 \u00a0 price = detail['price']\r\n\u00a0 \u00a0 \u00a0 \u00a0 full_address = detail['address']\r\n\u00a0 \u00a0 \u00a0 \u00a0 address_street = detail['addressStreet']\r\n\u00a0 \u00a0 \u00a0 \u00a0 city = detail['addressCity']\r\n\u00a0 \u00a0 \u00a0 \u00a0 state = detail['addressState']\r\n\u00a0 \u00a0 \u00a0 \u00a0 zipcode = detail['addressZipcode']\r\n\u00a0 \u00a0 \u00a0 \u00a0 home_info = detail['hdpData']['homeInfo']\r\n\u00a0 \u00a0 \u00a0 \u00a0 bedrooms = home_info['bedrooms']\r\n\u00a0 \u00a0 \u00a0 \u00a0 bathrooms = home_info['bathrooms']\r\n\u00a0 \u00a0 \u00a0 \u00a0 house_size = home_info['livingArea']\r\n\u00a0 \u00a0 \u00a0 \u00a0 lot_size = home_info['lotAreaValue']\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0lot_unit = home_info['lotAreaUnit']\r\n\u00a0 \u00a0 \u00a0 \u00a0 house_type = home_info['homeType']\r\n\u00a0 \u00a0 \u00a0 \u00a0\r\n\u00a0 \u00a0 \u00a0 \u00a0 # Write the row to the CSV file\r\n\u00a0 \u00a0 \u00a0 \u00a0 csvwriter.writerow([house_url, photo_urls, price, full_address, address_street, city, state, zipcode, bedrooms, bathrooms, house_size, f'{lot_size} {lot_unit}', house_type])\r\n\r\nprint(\"Data has been saved to house_details.csv\")<\/pre>\n<p><span style=\"font-weight: 400;\">This is all the output from the first page which is 41 in total.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-1226\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/14-csv_output-1024x671.png\" alt=\"\" width=\"640\" height=\"419\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/14-csv_output-1024x671.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/14-csv_output-300x197.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/14-csv_output-768x503.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/14-csv_output-624x409.png 624w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/14-csv_output.png 1135w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/p>\n<h2 id=\"complete-code-first-page\"><span style=\"font-weight: 400;\">Complete code for the first page<\/span><\/h2>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import os\r\nimport requests\r\nfrom bs4 import BeautifulSoup\r\nimport json\r\nimport csv\r\nfrom dotenv import load_dotenv\r\n\r\nload_dotenv()\r\n\r\n# Define headers for the HTTP request\r\nHEADERS = {\r\n\u00a0 \u00a0 \"User-Agent\": 'Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/129.0.0.0 Safari\/537.36',\r\n\u00a0 \u00a0 \"Accept\": \"text\/html,application\/xhtml+xml,application\/xml;q=0.9,image\/webp,*\/*;q=0.8\",\r\n\u00a0 \u00a0 \"Accept-Language\": \"en-US,en;q=0.5\",\r\n\u00a0 \u00a0 \"Accept-Encoding\": \"gzip, deflate, br\",\r\n\u00a0 \u00a0 \"DNT\": \"1\",\r\n\u00a0 \u00a0 \"Connection\": \"keep-alive\",\r\n\u00a0 \u00a0 \"Upgrade-Insecure-Requests\": \"1\",\r\n}\r\n\r\n# Define proxy settings (if needed)\r\nproxy = os.getenv(\"PROXY\")\r\n\r\nPROXIES = {\r\n\u00a0 \u00a0 'http': f'http:\/\/{proxy}',\r\n\u00a0 \u00a0 'https': f'http:\/\/{proxy}'\r\n}\r\n\r\n\r\ndef fetch_data(url):\r\n\u00a0 \u00a0 try:\r\n\u00a0 \u00a0 \u00a0 \u00a0 response = requests.get(url, headers=HEADERS, proxies=PROXIES)\r\n\u00a0 \u00a0 \u00a0 \u00a0 response.raise_for_status()\r\n\u00a0 \u00a0 \u00a0 \u00a0 return response.content\r\n\u00a0 \u00a0 except requests.RequestException as e:\r\n\u00a0 \u00a0 \u00a0 \u00a0 print(f\"Error fetching data: {e}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 return None\r\n\r\n\r\ndef parse_data(content):\r\n\u00a0 \u00a0 soup = BeautifulSoup(content, 'html.parser')\r\n\u00a0 \u00a0 script_content = soup.find('script', id='__NEXT_DATA__')\r\n\r\n\u00a0 \u00a0 if script_content:\r\n\u00a0 \u00a0 \u00a0 \u00a0 json_content = script_content.string\r\n\u00a0 \u00a0 \u00a0 \u00a0 return json.loads(json_content)\r\n\u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 print(\"Could not find the required script tag.\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 return None\r\n\r\n\r\ndef save_to_csv(house_details, output_file):\r\n\u00a0 \u00a0 with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:\r\n\u00a0 \u00a0 \u00a0 \u00a0 csvwriter = csv.writer(csvfile)\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 csvwriter.writerow([\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'STREET', 'CITY', 'STATE', 'ZIP CODE',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'\r\n\u00a0 \u00a0 \u00a0 \u00a0 ])\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 for detail in house_details:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info = detail['hdpData']['homeInfo']\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 photo_urls = ','.join([photo['url']\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 for photo in detail['carouselPhotos']])\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # Concatenate lot area value and unit\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 lot_size = f\"{home_info.get('lotAreaValue')} {home_info.get('lotAreaUnit')}\"\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 csvwriter.writerow([\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail['detailUrl'],\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 photo_urls,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail['price'],\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail['address'],\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail['addressStreet'],\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail['addressCity'],\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail['addressState'],\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail['addressZipcode'],\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('bedrooms'),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('bathrooms'),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('livingArea'),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 lot_size,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('homeType').replace('_', ' ')\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 ])\r\n\r\n\r\ndef main():\r\n\u00a0 \u00a0 URL = \"https:\/\/www.zillow.com\/ne\"\r\n\u00a0 \u00a0 content = fetch_data(URL)\r\n\r\n\u00a0 \u00a0 output_directory = 'OUTPUT_1'\r\n\u00a0 \u00a0 os.makedirs(output_directory, exist_ok=True)\r\n\u00a0 \u00a0 file_name = 'house_details_first_page.csv'\r\n\u00a0 \u00a0 output_file = os.path.join(output_directory, file_name)\r\n\r\n\u00a0 \u00a0 if content:\r\n\u00a0 \u00a0 \u00a0 \u00a0 data = parse_data(content)\r\n\u00a0 \u00a0 \u00a0 \u00a0 if data:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 save_to_csv(house_details, output_file)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 print(f\"Data has been saved to {output_file}\")\r\n\r\n\r\nif __name__ == \"__main__\":\r\n\u00a0 \u00a0 main()<\/pre>\n<p><span style=\"font-weight: 400;\">After running this code, it will create a new folder named <code>OUTPUT_1<\/code> and you\u2019ll find the file name <code>house_details_first_page.csv<\/code> inside it.<\/span><\/p>\n<h2 id=\"next-page\"><span style=\"font-weight: 400;\">Get the information from the next page<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">First, take a look at the URLs for the pages we want to scrape:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">First Page: https:\/\/www.zillow.com\/ne\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Second Page: https:\/\/www.zillow.com\/ne\/2_p<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Third Page: https:\/\/www.zillow.com\/ne\/3_p\u00a0<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Notice how the page number increments by 1 with each subsequent page.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To automate the scraping process, we will utilize a while loop that iterates through all the pages. Here\u2019s how we can set it up:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">base_url = \"https:\/\/www.zillow.com\/ne\"\r\npage = 1\r\nmax_pages = 10\u00a0 # Adjust this to scrape more pages, or set to None for all pages\r\n\r\nwhile max_pages is None or page &lt;= max_pages:\r\n\u00a0 \u00a0 if page == 1:\r\n\u00a0 \u00a0 \u00a0 \u00a0 url = base_url\r\n\u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 url = f\"{base_url}\/{page}_p\"<\/pre>\n<h2 id=\"complete-code-all-pages\"><span style=\"font-weight: 400;\">Complete code for all pages<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Here\u2019s the complete code to scrape all the pages.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Below is the complete code that scrapes all specified pages. We will also use <\/span><a href=\"https:\/\/github.com\/tqdm\/tqdm\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">tqdm <\/span><\/a><span style=\"font-weight: 400;\">to monitor our scraping progress. To install tqdm, run:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">pip install tqdm<\/pre>\n<p><span style=\"font-weight: 400;\">Additionally, we&#8217;ll implement logging to capture any errors during execution. A log file named <code>scraper.log<\/code><\/span><span style=\"font-weight: 400;\">\u00a0will be created to store these logs.<\/span><\/p>\n<p><b>Important Notes<\/b><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">The current setup limits scraping to 5 pages. <strong>To modify this scraper to extract data from all available pages, simply change <code>max_pages<\/code> on <code>line 101<\/code> to None.<\/strong><\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Don&#8217;t forget to update your proxy credentials as necessary.<\/span><\/li>\n<\/ul>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import os\r\nimport requests\r\nfrom bs4 import BeautifulSoup\r\nimport json\r\nimport csv\r\nimport time\r\nimport logging\r\nfrom tqdm import tqdm\r\nfrom dotenv import load_dotenv\r\n\r\nload_dotenv()\r\n\r\n# Set up logging\r\nlogging.basicConfig(filename='scraper.log', level=logging.INFO,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 format='%(asctime)s - %(levelname)s - %(message)s')\r\n\r\n# Define headers for the HTTP request\r\nHEADERS = {\r\n\u00a0 \u00a0 \"User-Agent\": 'Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/129.0.0.0 Safari\/537.36',\r\n\u00a0 \u00a0 \"Accept\": \"text\/html,application\/xhtml+xml,application\/xml;q=0.9,image\/webp,*\/*;q=0.8\",\r\n\u00a0 \u00a0 \"Accept-Language\": \"en-US,en;q=0.5\",\r\n\u00a0 \u00a0 \"Accept-Encoding\": \"gzip, deflate, br\",\r\n\u00a0 \u00a0 \"DNT\": \"1\",\r\n\u00a0 \u00a0 \"Connection\": \"keep-alive\",\r\n\u00a0 \u00a0 \"Upgrade-Insecure-Requests\": \"1\",\r\n}\r\n\r\n# Define proxy settings (if needed)\r\nproxy = os.getenv(\"PROXY\")\r\n\r\nPROXIES = {\r\n\u00a0 \u00a0 'http': f'http:\/\/{proxy}',\r\n\u00a0 \u00a0 'https': f'http:\/\/{proxy}'\r\n}\r\n\r\n\r\ndef fetch_data(url):\r\n\u00a0 \u00a0 try:\r\n\u00a0 \u00a0 \u00a0 \u00a0 response = requests.get(url, headers=HEADERS, proxies=PROXIES)\r\n\u00a0 \u00a0 \u00a0 \u00a0 response.raise_for_status()\r\n\u00a0 \u00a0 \u00a0 \u00a0 return response.content\r\n\u00a0 \u00a0 except requests.RequestException as e:\r\n\u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"Error fetching data: {e}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 return None\r\n\r\n\r\ndef parse_data(content):\r\n\u00a0 \u00a0 try:\r\n\u00a0 \u00a0 \u00a0 \u00a0 soup = BeautifulSoup(content, 'html.parser')\r\n\u00a0 \u00a0 \u00a0 \u00a0 script_content = soup.find('script', id='__NEXT_DATA__')\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 if script_content:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 json_content = script_content.string\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 return json.loads(json_content)\r\n\u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(\"Could not find the required script tag.\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 return None\r\n\u00a0 \u00a0 except json.JSONDecodeError as e:\r\n\u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"Error parsing JSON: {e}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 return None\r\n\r\n\r\ndef save_to_csv(house_details, output_file, mode='a'):\r\n\u00a0 \u00a0 with open(output_file, mode, newline='', encoding='utf-8') as csvfile:\r\n\u00a0 \u00a0 \u00a0 \u00a0 csvwriter = csv.writer(csvfile)\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 if mode == 'w':\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 csvwriter.writerow([\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'STREET', 'CITY', 'STATE', 'ZIP CODE',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 ])\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 for detail in tqdm(house_details, desc=\"Saving house details\", unit=\"house\"):\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 try:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info = detail.get('hdpData', {}).get('homeInfo', {})\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 photo_urls = ','.join([photo.get('url', '')\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 for photo in detail.get('carouselPhotos', [])])\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # Concatenate lot area value and unit\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 lot_size = f\"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}\"\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 csvwriter.writerow([\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('detailUrl', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 photo_urls,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('price', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('address', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('addressStreet', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('addressCity', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('addressState', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('addressZipcode', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('bedrooms', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('bathrooms', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('livingArea', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 lot_size,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('homeType', '').replace('_', ' ')\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 ])\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 except Exception as e:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"Error processing house detail: {e}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"Problematic detail: {detail}\")\r\n\r\n\r\ndef main():\r\n\u00a0 \u00a0 base_url = \"https:\/\/www.zillow.com\/ne\"\r\n\u00a0 \u00a0 page = 1\r\n\u00a0 \u00a0 max_pages = 5\u00a0 # Set this to the number of pages you want to scrape, or None for all pages\r\n\r\n\u00a0 \u00a0 output_directory = 'OUTPUT_1'\r\n\u00a0 \u00a0 os.makedirs(output_directory, exist_ok=True)\r\n\u00a0 \u00a0 file_name = f'house_details-1-{max_pages}.csv'\r\n\u00a0 \u00a0 output_file = os.path.join(output_directory, file_name)\r\n\r\n\u00a0 \u00a0 with tqdm(total=max_pages, desc=\"Scraping pages\", unit=\"page\") as pbar:\r\n\u00a0 \u00a0 \u00a0 \u00a0 while max_pages is None or page &lt;= max_pages:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if page == 1:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 url = base_url\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 url = f\"{base_url}\/{page}_p\"\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.info(f\"Scraping page {page}: {url}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 content = fetch_data(url)\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if content:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 data = parse_data(content)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if data:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 try:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if house_details:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 save_to_csv(house_details, output_file,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 mode='a' if page &gt; 1 else 'w')\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.info(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 f\"Data from page {page} has been saved to house_details-1-10.csv\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.info(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 f\"No more results found on page {page}. Stopping.\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 break\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 except KeyError as e:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"KeyError on page {page}: {e}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"Data structure: {data}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 break\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 f\"Failed to parse data from page {page}. Stopping.\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 break\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 f\"Failed to fetch data from page {page}. Stopping.\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 break\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 page += 1\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 pbar.update(1)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # Add a delay between requests to be respectful to the server\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 time.sleep(5)\r\n\r\n\u00a0 \u00a0 logging.info(\"Scraping completed.\")\r\n\r\n\r\nif __name__ == \"__main__\":\r\n\u00a0 \u00a0 main()<\/pre>\n<h2 id=\"part-2\"><span style=\"font-weight: 400;\">[PART 2 ] Scrape the other information from the Properties page<\/span><\/h2>\n<ul>\n<li>Year Built, Description, Listing Date, Days on Zillow, Total Views, Total Saved, Realtor Name, Realtor Contact Number, Agency, Co-realtor Name, Co-realtor contact number, Co-realtor agency<\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">To extract additional information from a Zillow property listing that is not available directly on the search results page, we need to send a GET request to the specific <code>HOUSE URL<\/code>. This will allow us to gather details such as the year built, description, listing updated date, realtor information, number of views, and number of saves.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">First, we will define the <code>HOUSE URL<\/code> from which we want to extract the additional information. This URL may vary depending on the specific property you are scraping.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_url = 'https:\/\/www.zillow.com\/homedetails\/7017-S-132nd-Ave-Omaha-NE-68138\/58586050_zpid\/'\r\n\r\nresponse = requests.get(house_url, headers=HEADERS, proxies=PROXIES)\r\nsoup = BeautifulSoup(response.content, 'html.parser')<\/pre>\n<p><span style=\"font-weight: 400;\">Since we already have the image urls we will be focusing inside this container which holds the relevant data for extraction.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1227\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/15-content_container.png\" alt=\"content container\" width=\"822\" height=\"751\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/15-content_container.png 822w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/15-content_container-300x274.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/15-content_container-768x702.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/15-content_container-624x570.png 624w\" sizes=\"auto, (max-width: 822px) 100vw, 822px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">content = soup.find('div', class_='ds-data-view-list')<\/pre>\n<p><span style=\"font-weight: 400;\">Now let\u2019s extract the Year Built:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1228\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/16-built_year.png\" alt=\"year built element\" width=\"438\" height=\"158\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/16-built_year.png 438w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/16-built_year-300x108.png 300w\" sizes=\"auto, (max-width: 438px) 100vw, 438px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Since there are a few other elements with the same span tag and class name, we\u2019re going to be more specific by finding the element with the text &#8220;Built in&#8221;<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: \"Built in\" in text)\r\nyear_built = year.text.strip().replace('Built in ', '')\r\nyear_built<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1229\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/17-built_year_result.png\" alt=\"year built result\" width=\"820\" height=\"272\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/17-built_year_result.png 820w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/17-built_year_result-300x100.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/17-built_year_result-768x255.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/17-built_year_result-624x207.png 624w\" sizes=\"auto, (max-width: 820px) 100vw, 820px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">The property description can be found within a specific div tag identified by its <code>data-testid<\/code>.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1231\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/description_element.png\" alt=\"description elements\" width=\"771\" height=\"260\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/description_element.png 771w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/description_element-300x101.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/description_element-768x259.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/description_element-624x210.png 624w\" sizes=\"auto, (max-width: 771px) 100vw, 771px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">description = content.find('div', attrs={'data-testid': 'description'}).text.strip()\r\ndescription<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1232\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/18-description_output.png\" alt=\"Description output\" width=\"860\" height=\"142\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/18-description_output.png 860w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/18-description_output-300x50.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/18-description_output-768x127.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/18-description_output-624x103.png 624w\" sizes=\"auto, (max-width: 860px) 100vw, 860px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">If we notice at the end of the code there is a &#8216;Show more&#8217; string. So let\u2019s remove this by replacing this string with empty string.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">description = content.find('div', attrs={'data-testid': 'description'}).text.strip().replace('Show more','')<\/pre>\n<p><span style=\"font-weight: 400;\">Get the listing date:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1238\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/19-listing_date.png\" alt=\"listing date element\" width=\"418\" height=\"126\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/19-listing_date.png 418w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/19-listing_date-300x90.png 300w\" sizes=\"auto, (max-width: 418px) 100vw, 418px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Similar to extracting the year built, we will find the listing updated date using a specific class name and filtering for relevant text.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">listing_details = content.find_all('p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and \"Listing updated\" in text)\r\ndate_details = listing_details[0].text.strip()\r\ndate_details = listing_details[0].text.strip()\r\ndate_part = date_details.split(' at ')[0]\r\nlisting_date = date_part.replace('Listing updated: ', '').strip()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1242\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/20-listing_date_output.png\" alt=\"listing date output\" width=\"861\" height=\"402\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/20-listing_date_output.png 861w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/20-listing_date_output-300x140.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/20-listing_date_output-768x359.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/20-listing_date_output-624x291.png 624w\" sizes=\"auto, (max-width: 861px) 100vw, 861px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Get the days on Zillow, total views and total saved<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1243\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/21-dt_tag.png\" alt=\"dt tag\" width=\"522\" height=\"432\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/21-dt_tag.png 522w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/21-dt_tag-300x248.png 300w\" sizes=\"auto, (max-width: 522px) 100vw, 522px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">These values can be found within <code>dt<\/code> tags. We will extract them based on their positions.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">containers = content.find_all('dt')<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1244\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/22-dt_container.png\" alt=\"dt container\" width=\"863\" height=\"216\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/22-dt_container.png 863w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/22-dt_container-300x75.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/22-dt_container-768x192.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/22-dt_container-624x156.png 624w\" sizes=\"auto, (max-width: 863px) 100vw, 863px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">days_on_zillow = containers[0].text.strip()\r\nviews = containers[2].text.strip()\r\ntotal_save = containers[4].text.strip()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1247\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/23-dt_output.png\" alt=\"dt output\" width=\"487\" height=\"181\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/23-dt_output.png 487w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/23-dt_output-300x111.png 300w\" sizes=\"auto, (max-width: 487px) 100vw, 487px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Finally, we will extract information about the realtor and their agency from specific <code>p<\/code> tags.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1248\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/24-realtor.png\" alt=\"realtor element tag\" width=\"460\" height=\"229\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/24-realtor.png 460w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/24-realtor-300x149.png 300w\" sizes=\"auto, (max-width: 460px) 100vw, 460px\" \/> <img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1252\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/25-realtor_container.png\" alt=\"realtor container\" width=\"470\" height=\"188\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/25-realtor_container.png 470w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/25-realtor_container-300x120.png 300w\" sizes=\"auto, (max-width: 470px) 100vw, 470px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">If we expand the <code>p<\/code> tag, we can see the values that we want inside it.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">realtor_content = content.find('p', attrs={'data-testid': 'attribution-LISTING_AGENT'}).text.strip().replace(',', '')\r\nprint('REALTOR:', realtor_content)<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1254\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/26-realtor_details_output.png\" alt=\"realtor output details\" width=\"594\" height=\"137\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/26-realtor_details_output.png 594w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/26-realtor_details_output-300x69.png 300w\" sizes=\"auto, (max-width: 594px) 100vw, 594px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">As we see from the output above, the realtor\u2019s name and contact number are inside the same \u2018element\u2019 so let\u2019s separate them to make our data look nice and clean.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">name, contact = realtor_content.split('M:')\r\nrealtor_name = name.strip()\r\nrealtor_contact = contact.strip()\r\nprint('REALTOR NAME:', realtor_name)\r\nprint('REALTOR CONTACT NO:', realtor_contact)<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1256\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/27-realtor_separate_output.png\" alt=\"realtor output seperate\" width=\"466\" height=\"131\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/27-realtor_separate_output.png 466w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/27-realtor_separate_output-300x84.png 300w\" sizes=\"auto, (max-width: 466px) 100vw, 466px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">agency_name = content.find('p', attrs={'data-testid': 'attribution-BROKER'}).text.strip().replace(',', '') print('OFFICE:', agency_name)<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1257\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/28-agency.png\" alt=\"agency name\" width=\"536\" height=\"132\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/28-agency.png 536w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/28-agency-300x74.png 300w\" sizes=\"auto, (max-width: 536px) 100vw, 536px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">co_realtor_content = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'}).text.strip().replace(',', '') print('CO-REALTOR CONTENT:', co_realtor_content)<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1260\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/29-co_realtor.png\" alt=\"\" width=\"580\" height=\"136\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/29-co_realtor.png 580w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/29-co_realtor-300x70.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Same as before we need to split the name and contact number.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">name_contact = co_realtor_content.rsplit(' ', 1)\r\nname = name_contact[0]\r\ncontact = name_contact[1]\r\nco_realtor_name = name.strip()\r\nco_realtor_contact = contact.strip()\r\nprint(f\"CO-REALTOR NAME: {co_realtor_name}\")\r\nprint(f\"CO-REALTOR CONTACT NO: {co_realtor_contact}\")<\/pre>\n<h2><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1261\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/30-co_realtor_output.png\" alt=\"co-realtor separate output\" width=\"652\" height=\"132\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/30-co_realtor_output.png 652w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/30-co_realtor_output-300x61.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/30-co_realtor_output-624x126.png 624w\" sizes=\"auto, (max-width: 652px) 100vw, 652px\" \/><\/h2>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">co_realtor_agency_name = content.find('p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'}).text.strip() print('CO-REALTOR AGENCY NAME:', co_realtor_agency_name)<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1264\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/31-co_realtor_agec-y.png\" alt=\"co-realtor agency\" width=\"601\" height=\"132\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/31-co_realtor_agec-y.png 601w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/31-co_realtor_agec-y-300x66.png 300w\" sizes=\"auto, (max-width: 601px) 100vw, 601px\" \/><\/p>\n<h2 id=\"complete-code-additional-data\"><span style=\"font-weight: 400;\">Complete code with the additional data<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Let\u2019s enhance our data collection process by<strong> creating a new Python file<\/strong> dedicated to fetching additional information. This script will first read the <code>HOUSE URLs<\/code> from the existing CSV file, sending requests for each URL to extract valuable data. Once all information is gathered, it will save the results in a new CSV file, preserving the original data for reference.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import os\r\nimport requests\r\nfrom bs4 import BeautifulSoup\r\nimport json\r\nimport csv\r\nimport time\r\nimport logging\r\nfrom tqdm import tqdm\r\nfrom dotenv import load_dotenv\r\n\r\nload_dotenv()\r\n\r\n# Set up logging\r\nlogging.basicConfig(filename='scraper.log', level=logging.INFO,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 format='%(asctime)s - %(levelname)s - %(message)s')\r\n\r\n# Define headers for the HTTP request\r\nHEADERS = {\r\n\u00a0 \u00a0 \"User-Agent\": 'Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/129.0.0.0 Safari\/537.36',\r\n\u00a0 \u00a0 \"Accept\": \"text\/html,application\/xhtml+xml,application\/xml;q=0.9,image\/webp,*\/*;q=0.8\",\r\n\u00a0 \u00a0 \"Accept-Language\": \"en-US,en;q=0.5\",\r\n\u00a0 \u00a0 \"Accept-Encoding\": \"gzip, deflate, br\",\r\n\u00a0 \u00a0 \"DNT\": \"1\",\r\n\u00a0 \u00a0 \"Connection\": \"keep-alive\",\r\n\u00a0 \u00a0 \"Upgrade-Insecure-Requests\": \"1\",\r\n}\r\n\r\n# Define proxy settings (if needed)\r\nproxy = os.getenv(\"PROXY\")\r\n\r\nPROXIES = {\r\n\u00a0 \u00a0 'http': f'http:\/\/{proxy}',\r\n\u00a0 \u00a0 'https': f'http:\/\/{proxy}'\r\n}\r\n\r\n\r\ndef fetch_data(url):\r\n\u00a0 \u00a0 try:\r\n\u00a0 \u00a0 \u00a0 \u00a0 response = requests.get(url, headers=HEADERS, proxies=PROXIES)\r\n\u00a0 \u00a0 \u00a0 \u00a0 response.raise_for_status()\r\n\u00a0 \u00a0 \u00a0 \u00a0 return response.content\r\n\u00a0 \u00a0 except requests.RequestException as e:\r\n\u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"Error fetching data: {e}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 return None\r\n\r\n\r\ndef parse_data(content):\r\n\u00a0 \u00a0 try:\r\n\u00a0 \u00a0 \u00a0 \u00a0 soup = BeautifulSoup(content, 'html.parser')\r\n\u00a0 \u00a0 \u00a0 \u00a0 script_content = soup.find('script', id='__NEXT_DATA__')\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 if script_content:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 json_content = script_content.string\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 return json.loads(json_content)\r\n\u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(\"Could not find the required script tag.\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 return None\r\n\u00a0 \u00a0 except json.JSONDecodeError as e:\r\n\u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"Error parsing JSON: {e}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 return None\r\n\r\n\r\ndef save_to_csv(house_details, mode='a'):\r\n\u00a0 \u00a0 output_directory = 'OUTPUT_1'\r\n\u00a0 \u00a0 os.makedirs(output_directory, exist_ok=True)\r\n\u00a0 \u00a0 file_name = 'house_details-1-5.csv'\u00a0 # Change accordingly\r\n\u00a0 \u00a0 output_file = os.path.join(output_directory, file_name)\r\n\r\n\u00a0 \u00a0 with open(output_file, mode, newline='', encoding='utf-8') as csvfile:\r\n\u00a0 \u00a0 \u00a0 \u00a0 csvwriter = csv.writer(csvfile)\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 if mode == 'w':\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 csvwriter.writerow([\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'HOUSE URL', 'PHOTO URLs', 'PRICE', 'FULL ADDRESS',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'STREET', 'CITY', 'STATE', 'ZIP CODE',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'NUMBER OF BEDROOMS', 'NUMBER OF BATHROOMS',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'HOUSE SIZE', 'LOT SIZE', 'HOUSE TYPE'\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 ])\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 for detail in tqdm(house_details, desc=\"Saving house details\", unit=\"house\"):\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 try:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info = detail.get('hdpData', {}).get('homeInfo', {})\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 photo_urls = ','.join([photo.get('url', '')\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 for photo in detail.get('carouselPhotos', [])])\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # Concatenate lot area value and unit\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 lot_size = f\"{home_info.get('lotAreaValue', '')} {home_info.get('lotAreaUnit', '')}\"\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 csvwriter.writerow([\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('detailUrl', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 photo_urls,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('price', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('address', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('addressStreet', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('addressCity', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('addressState', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 detail.get('addressZipcode', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('bedrooms', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('bathrooms', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('livingArea', ''),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 lot_size,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 home_info.get('homeType', '').replace('_', ' ')\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 ])\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 except Exception as e:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"Error processing house detail: {e}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"Problematic detail: {detail}\")\r\n\r\n\r\ndef main():\r\n\u00a0 \u00a0 base_url = \"https:\/\/www.zillow.com\/ne\"\r\n\u00a0 \u00a0 page = 1\r\n\u00a0 \u00a0 max_pages = 5\u00a0 # Set this to the number of pages you want to scrape, or None for all pages\r\n\r\n\u00a0 \u00a0 with tqdm(total=max_pages, desc=\"Scraping pages\", unit=\"page\") as pbar:\r\n\u00a0 \u00a0 \u00a0 \u00a0 while max_pages is None or page &lt;= max_pages:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if page == 1:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 url = base_url\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 url = f\"{base_url}\/{page}_p\"\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.info(f\"Scraping page {page}: {url}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 content = fetch_data(url)\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if content:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 data = parse_data(content)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if data:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 try:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 house_details = data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if house_details:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 save_to_csv(house_details,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 mode='a' if page &gt; 1 else 'w')\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.info(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 f\"Data from page {page} has been saved to house_details-1-10.csv\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.info(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 f\"No more results found on page {page}. Stopping.\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 break\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 except KeyError as e:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"KeyError on page {page}: {e}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"Data structure: {data}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 break\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 f\"Failed to parse data from page {page}. Stopping.\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 break\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 f\"Failed to fetch data from page {page}. Stopping.\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 break\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 page += 1\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 pbar.update(1)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # Add a delay between requests to be respectful to the server\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 time.sleep(5)\r\n\r\n\u00a0 \u00a0 logging.info(\"Scraping completed.\")\r\n\r\n\r\nif __name__ == \"__main__\":\r\n\u00a0 \u00a0 main()<\/pre>\n<p><span style=\"font-weight: 400;\">Why Create a New File?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The decision to generate a new file instead of overwriting the previous one serves as a safeguard. This approach ensures that we have a backup in case our code encounters issues or if access is blocked, allowing us to maintain data integrity throughout the process.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">By implementing this strategy, we not only enhance our data collection capabilities but also ensure that we can troubleshoot effectively without losing any valuable information.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h2 id=\"complete-code-proxy-rotation\"><span style=\"font-weight: 400;\">Complete code for the additional data with Proxy Rotation<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Implementing proxy rotation is essential for avoiding anti-bot detection, especially when making numerous requests to a website. In this tutorial, we will demonstrate how to gather additional data from Zillow property listings while utilizing proxies from <\/span><a href=\"https:\/\/rayobyte.com\/products\/residential-proxies\/\"><span style=\"font-weight: 400;\">Rayobyte<\/span><\/a><span style=\"font-weight: 400;\">, which offers 50MB of residential proxy traffic for free upon signup..<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Download and Prepare the Proxy List<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Sign Up for Rayobyte: Create an account on Rayobyte to access their proxy services.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Generate Proxy List:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Navigate to the &#8220;<strong>Proxy List Generator<\/strong>&#8221; in your dashboard.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Set the format to <strong>username:password@hostname:port<\/strong>.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Download the proxy list.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Move the Proxy File: Locate the downloaded file in your downloads directory and move it to your code directory.<\/span><\/p>\n<h3><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-1265\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/32-rayobyte_dashboard-1024x649.png\" alt=\"rayobyte dashboard\" width=\"640\" height=\"406\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/32-rayobyte_dashboard-1024x649.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/32-rayobyte_dashboard-300x190.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/32-rayobyte_dashboard-768x487.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/32-rayobyte_dashboard-624x396.png 624w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/32-rayobyte_dashboard.png 1194w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/h3>\n<h3><span style=\"font-weight: 400;\">Implement Proxy Rotation in Your Code<\/span><\/h3>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import os\r\nimport requests\r\nfrom bs4 import BeautifulSoup\r\nfrom dotenv import load_dotenv\r\nimport pandas as pd\r\nimport random\r\nimport time\r\nimport logging\r\nfrom tqdm import tqdm\r\n\r\nload_dotenv()\r\n\r\n# Set up logging\r\nlogging.basicConfig(filename='scraper.log', level=logging.INFO,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 format='%(asctime)s - %(levelname)s - %(message)s',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 datefmt='%Y-%m-%d %H:%M:%S')\r\n\r\nHEADERS = {\r\n\u00a0 \u00a0 \"User-Agent\": 'Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/129.0.0.0 Safari\/537.36',\r\n\u00a0 \u00a0 \"Accept\": \"text\/html,application\/xhtml+xml,application\/xml;q=0.9,image\/webp,*\/*;q=0.8\",\r\n\u00a0 \u00a0 \"Accept-Language\": \"en-US,en;q=0.5\",\r\n\u00a0 \u00a0 \"Accept-Encoding\": \"gzip, deflate, br\",\r\n\u00a0 \u00a0 \"DNT\": \"1\",\r\n\u00a0 \u00a0 \"Connection\": \"keep-alive\",\r\n\u00a0 \u00a0 \"Upgrade-Insecure-Requests\": \"1\",\r\n}\r\n\r\n\r\ndef load_proxies(file_path):\r\n\u00a0 \u00a0 with open(file_path, 'r') as f:\r\n\u00a0 \u00a0 \u00a0 \u00a0 return [line.strip() for line in f if line.strip()]\r\n\r\n\r\nPROXY_LIST = load_proxies('proxy-list.txt')\r\n\r\n\r\ndef get_random_proxy():\r\n\u00a0 \u00a0 return random.choice(PROXY_LIST)\r\n\r\n\r\ndef get_proxies(proxy):\r\n\u00a0 \u00a0 return {\r\n\u00a0 \u00a0 \u00a0 \u00a0 'http': f'http:\/\/{proxy}',\r\n\u00a0 \u00a0 \u00a0 \u00a0 'https': f'http:\/\/{proxy}'\r\n\u00a0 \u00a0 }\r\n\r\n\r\ndef scrape_with_retry(url, max_retries=3):\r\n\u00a0 \u00a0 for attempt in range(max_retries):\r\n\u00a0 \u00a0 \u00a0 \u00a0 proxy = get_random_proxy()\r\n\u00a0 \u00a0 \u00a0 \u00a0 proxies = get_proxies(proxy)\r\n\u00a0 \u00a0 \u00a0 \u00a0 try:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 response = requests.get(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 url, headers=HEADERS, proxies=proxies, timeout=30)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if response.status_code == 200:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 return response\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.warning(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 f\"Attempt {attempt + 1} failed with status code {response.status_code} for URL: {url}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 except requests.RequestException as e:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 logging.error(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 f\"Attempt {attempt + 1} failed with error: {e} for URL: {url}\")\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 time.sleep(random.uniform(1, 3))\r\n\r\n\u00a0 \u00a0 logging.error(\r\n\u00a0 \u00a0 \u00a0 \u00a0 f\"Failed to fetch data for {url} after {max_retries} attempts.\")\r\n\u00a0 \u00a0 return None\r\n\r\n\r\ndef scrape_house_data(house_url):\r\n\u00a0 \u00a0 response = scrape_with_retry(house_url)\r\n\u00a0 \u00a0 if not response:\r\n\u00a0 \u00a0 \u00a0 \u00a0 return None\r\n\r\n\u00a0 \u00a0 soup = BeautifulSoup(response.content, 'html.parser')\r\n\u00a0 \u00a0 content = soup.find('div', class_='ds-data-view-list')\r\n\r\n\u00a0 \u00a0 if not content:\r\n\u00a0 \u00a0 \u00a0 \u00a0 logging.error(f\"Failed to find content for {house_url}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 return None\r\n\r\n\u00a0 \u00a0 year = content.find('span', class_='Text-c11n-8-100-2__sc-aiai24-0',\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 string=lambda text: \"Built in\" in text)\r\n\u00a0 \u00a0 year_built = year.text.strip().replace('Built in ', '') if year else \"N\/A\"\r\n\r\n\u00a0 \u00a0 description_elem = content.find(\r\n\u00a0 \u00a0 \u00a0 \u00a0 'div', attrs={'data-testid': 'description'})\r\n\u00a0 \u00a0 description = description_elem.text.strip().replace(\r\n\u00a0 \u00a0 \u00a0 \u00a0 'Show more', '') if description_elem else \"N\/A\"\r\n\r\n\u00a0 \u00a0 listing_details = content.find_all(\r\n\u00a0 \u00a0 \u00a0 \u00a0 'p', class_='Text-c11n-8-100-2__sc-aiai24-0', string=lambda text: text and \"Listing updated\" in text)\r\n\u00a0 \u00a0 listing_date = \"N\/A\"\r\n\u00a0 \u00a0 if listing_details:\r\n\u00a0 \u00a0 \u00a0 \u00a0 date_details = listing_details[0].text.strip()\r\n\u00a0 \u00a0 \u00a0 \u00a0 date_part = date_details.split(' at ')[0]\r\n\u00a0 \u00a0 \u00a0 \u00a0 listing_date = date_part.replace('Listing updated: ', '').strip()\r\n\r\n\u00a0 \u00a0 containers = content.find_all('dt')\r\n\u00a0 \u00a0 days_on_zillow = containers[0].text.strip() if len(\r\n\u00a0 \u00a0 \u00a0 \u00a0 containers) &gt; 0 else \"N\/A\"\r\n\u00a0 \u00a0 views = containers[2].text.strip() if len(containers) &gt; 2 else \"N\/A\"\r\n\u00a0 \u00a0 total_save = containers[4].text.strip() if len(containers) &gt; 4 else \"N\/A\"\r\n\r\n\u00a0 \u00a0 realtor_elem = content.find(\r\n\u00a0 \u00a0 \u00a0 \u00a0 'p', attrs={'data-testid': 'attribution-LISTING_AGENT'})\r\n\u00a0 \u00a0 if realtor_elem:\r\n\u00a0 \u00a0 \u00a0 \u00a0 realtor_content = realtor_elem.text.strip().replace(',', '')\r\n\u00a0 \u00a0 \u00a0 \u00a0 if 'M:' in realtor_content:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 name, contact = realtor_content.split('M:')\r\n\u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 name_contact = realtor_content.rsplit(' ', 1)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 name = name_contact[0]\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 contact = name_contact[1]\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 realtor_name = name.strip()\r\n\u00a0 \u00a0 \u00a0 \u00a0 realtor_contact = contact.strip()\r\n\r\n\u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 realtor_name = \"N\/A\"\r\n\u00a0 \u00a0 \u00a0 \u00a0 realtor_contact = \"N\/A\"\r\n\r\n\u00a0 \u00a0 agency_elem = content.find(\r\n\u00a0 \u00a0 \u00a0 \u00a0 'p', attrs={'data-testid': 'attribution-BROKER'})\r\n\u00a0 \u00a0 agency_name = agency_elem.text.strip().replace(',', '') if agency_elem else \"N\/A\"\r\n\r\n\u00a0 \u00a0 co_realtor_elem = content.find(\r\n\u00a0 \u00a0 \u00a0 \u00a0 'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT'})\r\n\u00a0 \u00a0 if co_realtor_elem:\r\n\u00a0 \u00a0 \u00a0 \u00a0 co_realtor_content = co_realtor_elem.text.strip().replace(',', '')\r\n\u00a0 \u00a0 \u00a0 \u00a0 if 'M:' in co_realtor_content:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 name, contact = co_realtor_content.split('M:')\r\n\u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 name_contact = co_realtor_content.rsplit(' ', 1)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 name = name_contact[0]\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 contact = name_contact[1]\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 co_realtor_name = name.strip()\r\n\u00a0 \u00a0 \u00a0 \u00a0 co_realtor_contact = contact.strip()\r\n\r\n\u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 co_realtor_name = \"N\/A\"\r\n\u00a0 \u00a0 \u00a0 \u00a0 co_realtor_contact = \"N\/A\"\r\n\r\n\u00a0 \u00a0 co_realtor_agency_elem = content.find(\r\n\u00a0 \u00a0 \u00a0 \u00a0 'p', attrs={'data-testid': 'attribution-CO_LISTING_AGENT_OFFICE'})\r\n\u00a0 \u00a0 co_realtor_agency_name = co_realtor_agency_elem.text.strip(\r\n\u00a0 \u00a0 ) if co_realtor_agency_elem else \"N\/A\"\r\n\r\n\u00a0 \u00a0 return {\r\n\u00a0 \u00a0 \u00a0 \u00a0 'YEAR BUILT': year_built,\r\n\u00a0 \u00a0 \u00a0 \u00a0 'DESCRIPTION': description,\r\n\u00a0 \u00a0 \u00a0 \u00a0 'LISTING DATE': listing_date,\r\n\u00a0 \u00a0 \u00a0 \u00a0 'DAYS ON ZILLOW': days_on_zillow,\r\n\u00a0 \u00a0 \u00a0 \u00a0 'TOTAL VIEWS': views,\r\n\u00a0 \u00a0 \u00a0 \u00a0 'TOTAL SAVED': total_save,\r\n\u00a0 \u00a0 \u00a0 \u00a0 'REALTOR NAME': realtor_name,\r\n\u00a0 \u00a0 \u00a0 \u00a0 'REALTOR CONTACT NO': realtor_contact,\r\n\u00a0 \u00a0 \u00a0 \u00a0 'AGENCY': agency_name,\r\n\u00a0 \u00a0 \u00a0 \u00a0 'CO-REALTOR NAME': co_realtor_name,\r\n\u00a0 \u00a0 \u00a0 \u00a0 'CO-REALTOR CONTACT NO': co_realtor_contact,\r\n\u00a0 \u00a0 \u00a0 \u00a0 'CO-REALTOR AGENCY': co_realtor_agency_name\r\n\u00a0 \u00a0 }\r\n\r\n\r\ndef ensure_output_directory(directory):\r\n\u00a0 \u00a0 if not os.path.exists(directory):\r\n\u00a0 \u00a0 \u00a0 \u00a0 os.makedirs(directory)\r\n\u00a0 \u00a0 \u00a0 \u00a0 logging.info(f\"Created output directory: {directory}\")\r\n\r\n\r\ndef load_progress(output_file):\r\n\u00a0 \u00a0 if os.path.exists(output_file):\r\n\u00a0 \u00a0 \u00a0 \u00a0 return pd.read_csv(output_file)\r\n\u00a0 \u00a0 return pd.DataFrame()\r\n\r\n\r\ndef save_progress(df, output_file):\r\n\u00a0 \u00a0 df.to_csv(output_file, index=False)\r\n\u00a0 \u00a0 logging.info(f\"Progress saved to {output_file}\")\r\n\r\n\r\ndef main():\r\n\u00a0 \u00a0 input_file = '.\/OUTPUT_1\/house_details.csv'\r\n\r\n\u00a0 \u00a0 output_directory = 'OUTPUT_2'\r\n\u00a0 \u00a0 file_name = 'house_details_scraped.csv'\r\n\u00a0 \u00a0 output_file = os.path.join(output_directory, file_name)\r\n\u00a0 \u00a0 ensure_output_directory(output_directory)\r\n\r\n\u00a0 \u00a0 df = pd.read_csv(input_file)\r\n\r\n\u00a0 \u00a0 # Load existing progress\r\n\u00a0 \u00a0 result_df = load_progress(output_file)\r\n\r\n\u00a0 \u00a0 # Determine which URLs have already been scraped\r\n\u00a0 \u00a0 scraped_urls = set(result_df['HOUSE URL']\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 ) if 'HOUSE URL' in result_df.columns else set()\r\n\r\n\u00a0 \u00a0 # Scrape data for each house URL\r\n\u00a0 \u00a0 for _, row in tqdm(df.iterrows(), total=df.shape[0], desc=\"Scraping Progress\"):\r\n\u00a0 \u00a0 \u00a0 \u00a0 house_url = row['HOUSE URL']\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 # Skip if already scraped\r\n\u00a0 \u00a0 \u00a0 \u00a0 if house_url in scraped_urls:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 continue\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 logging.info(f\"Scraping data for {house_url}\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 data = scrape_house_data(house_url)\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 if data:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # Combine the original row data with the scraped data\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 combined_data = {**row.to_dict(), **data}\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 new_row = pd.DataFrame([combined_data])\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # Append the new row to the result DataFrame\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 result_df = pd.concat([result_df, new_row], ignore_index=True)\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # Save progress after each successful scrape\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 save_progress(result_df, output_file)\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 # Add a random delay between requests (1 to 5 seconds)\r\n\u00a0 \u00a0 \u00a0 \u00a0 time.sleep(random.uniform(1, 5))\r\n\r\n\u00a0 \u00a0 logging.info(f\"Scraping completed. Final results saved to {output_file}\")\r\n\u00a0 \u00a0 print(\r\n\u00a0 \u00a0 \u00a0 \u00a0 f\"Scraping completed. Check {output_file} for results and scraper.log for detailed logs.\")\r\n\r\n\r\nif __name__ == \"__main__\":\r\n\u00a0 \u00a0 main()<\/pre>\n<h2 id=\"conclusions\"><span style=\"font-weight: 400;\">Conclusion<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">In conclusion, this comprehensive guide on Zillow scraping with Python has equipped you with essential tools and techniques to effectively extract property listings and home prices. By following the outlined steps, you have learned how to navigate the complexities of web scraping, including overcoming anti-bot measures and utilizing proxies for seamless data retrieval.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Key takeaways from this tutorial include:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><b>Understanding the Ethical Considerations<\/b><span style=\"font-weight: 400;\">: Emphasizing responsible scraping practices to respect website performance and legal guidelines.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Scraping Workflow<\/b><span style=\"font-weight: 400;\">: Dividing the scraping process into manageable parts for clarity and efficiency.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Technical Implementation<\/b><span style=\"font-weight: 400;\">: Utilizing Python libraries such as requests, BeautifulSoup, and json for data extraction.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Data Storage<\/b><span style=\"font-weight: 400;\">: Saving extracted information in CSV format for easy access and analysis.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">As you implement these strategies, you will gain valuable insights into real estate trends and market dynamics, empowering you to make informed decisions based on the data collected. With the provided source code and detailed explanations, you are now well-prepared to adapt this project to your specific needs, whether that involves expanding your data collection or refining your analysis techniques. Embrace the power of data-driven insights as you explore the vast landscape of real estate information available through platforms like Zillow. Drop a comment below if you have any questions and Happy scraping!<\/span><\/p>\n<p>Source code:\u00a0<a href=\"https:\/\/github.com\/ainacodes\/zillow_properties_for_sale_scraper\" target=\"_blank\" rel=\"nofollow noopener\">zillow_properties_for_sale_scraper<\/a><\/p>\n<p>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=CyHNw0xwp8E\" rel=\"nofollow noopener\" target=\"_blank\">Extract data from Zillow properties for sale listing using Python<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Zillow Scraping with Python: Extract Property Listings and Home Prices Source code: zillow_properties_for_sale_scraper\u00a0 Table of Content Introduction Ethical Consideration Scraping Workflow Prerequisites Project Setup [PART&hellip;<\/p>\n","protected":false},"author":25,"featured_media":1266,"comment_status":"open","ping_status":"closed","template":"","meta":{"rank_math_lock_modified_date":false},"categories":[],"class_list":["post-1206","scraping_project","type-scraping_project","status-publish","has-post-thumbnail","hentry"],"_links":{"self":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/scraping_project\/1206","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/scraping_project"}],"about":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/types\/scraping_project"}],"author":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/users\/25"}],"replies":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/comments?post=1206"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/media\/1266"}],"wp:attachment":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/media?parent=1206"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/categories?post=1206"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}