{"id":1285,"date":"2024-10-28T07:57:50","date_gmt":"2024-10-28T07:57:50","guid":{"rendered":"https:\/\/rayobyte.com\/community\/?post_type=scraping_project&#038;p=1285"},"modified":"2024-10-30T17:16:23","modified_gmt":"2024-10-30T17:16:23","slug":"airbnb-web-scraping-with-python-extract-listings-and-pricing-data","status":"publish","type":"scraping_project","link":"https:\/\/rayobyte.com\/community\/scraping-project\/airbnb-web-scraping-with-python-extract-listings-and-pricing-data\/","title":{"rendered":"Airbnb Web Scraping with Python: Extract Listings and Pricing Data"},"content":{"rendered":"<p><iframe loading=\"lazy\" title=\"Airbnb Web Scraping with Python\" width=\"640\" height=\"360\" src=\"https:\/\/www.youtube.com\/embed\/R2YoMnqJ2rg?feature=oembed&#038;enablejsapi=1&#038;origin=https:\/\/rayobyte.com\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><\/p>\n<p><a href=\"https:\/\/github.com\/MDFARHYN\/airbnbScraping\" rel=\"nofollow noopener\" target=\"_blank\">Download the full code from GitHub.<\/a><\/p>\n<h1>Table of content<\/h1>\n<ul>\n<li><a href=\"#introduction\">Introduction<\/a><\/li>\n<li><a href=\"#Prerequisites\">Prerequisites<\/a><\/li>\n<li><a href=\"#Understanding Airbnb\u2019s Structure for scraping\">Understanding Airbnb\u2019s Structure for scraping<\/a><\/li>\n<li><a href=\"#Get Dynamic Content from Listing Pages\">Get Dynamic Content from Listing Pages<\/a><\/li>\n<li><a href=\"#Using Regex\">Using Regex<\/a><\/li>\n<li><a href=\"#Handling Pagination\">Handling Pagination<\/a><\/li>\n<li><a href=\"#Scrape Detail Page\">Scrape Detail Page<\/a><\/li>\n<li><a href=\"#Saving data to csv\">Saving data to csv<\/a><\/li>\n<li><a href=\"#Techniques to Prevent Getting Blocked\">Techniques to Prevent Getting Blocked<\/a><\/li>\n<li><a href=\"#Legal and Ethical Issues\">Legal and Ethical Issues<\/a><\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1288 size-full aligncenter\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/airbnb-scraping-with-python-web-scraping-guide.png\" alt=\"\" width=\"1024\" height=\"1024\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/airbnb-scraping-with-python-web-scraping-guide.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/airbnb-scraping-with-python-web-scraping-guide-300x300.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/airbnb-scraping-with-python-web-scraping-guide-150x150.png 150w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/airbnb-scraping-with-python-web-scraping-guide-768x768.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/airbnb-scraping-with-python-web-scraping-guide-624x624.png 624w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<h1 id=\"introduction\"><b>Introduction<\/b><\/h1>\n<p><span style=\"font-weight: 400;\">By scraping Airbnb, businesses and researchers can find out more about rental trends and consumer preferences as well as how the pricing dynamic works. This reports rich data has immeasurable benefits for competitive analysis, location-based investment decisions and understanding seasonal demand shifts. But only if it can be done legally by following Airbnb terms of services and privacy policies, ensuring ethics in data gathering as well being lawful.<\/span><\/p>\n<h1 id=\"Prerequisites\"><b>Prerequisites<\/b><\/h1>\n<p><span style=\"font-weight: 400;\">To start scraping Airbnb data, we&#8217;ll use Python with key libraries that make data extraction and management more efficient. Here\u2019s a quick overview of the tools and setup steps:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><b>Python<\/b><span style=\"font-weight: 400;\">: Ensure Python (preferably 3.10) is installed. You can download it from<\/span><a href=\"https:\/\/www.python.org\/\" rel=\"nofollow noopener\" target=\"_blank\"> <span style=\"font-weight: 400;\">Python\u2019s official site<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/li>\n<\/ol>\n<ol start=\"2\">\n<li><b>Regex:<\/b><span style=\"font-weight: 400;\"> This is another built into Python, which lets you extract targeted data by recognizing patterns within the HTML. It&#8217;s way faster, it works better for disparate structures and is easier to pick elements such as price, descriptions, location etc. It belongs to the Python standard library, so you do not have to install any additional package.<\/span><\/li>\n<li><b> Selenium-stealth: <\/b><span style=\"font-weight: 400;\">Few websites, like airbnb, put some bot protections and you can not scrape these websites directly. selenium-stealth bypasses these detections by acting with the sort of Human-like browsing behaviors.<\/span><\/li>\n<\/ol>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">pip install selenium-stealth<\/pre>\n<ol start=\"4\">\n<li><b>Pandas<\/b><span style=\"font-weight: 400;\">: This library helps manage, analyze, and store data in a structured way (like CSV files). Install with:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">pip install pandas<\/pre>\n<p>Once you have these libraries installed, you&#8217;re ready to begin data extraction. These tools will enable a more flexible and robust scraping workflow. Remember, before collecting data, always confirm that your usage aligns with Airbnb&#8217;s terms of service and guidelines.<\/li>\n<\/ol>\n<h1 id=\"Understanding Airbnb\u2019s Structure for scraping\"><b>Understanding Airbnb\u2019s Structure for scraping\u00a0<\/b><\/h1>\n<p><span style=\"font-weight: 400;\">To scrape Airbnb data, you need to know the structure of an Airbnb listing page so that you can identify important components such as property information and prices as well location coordinates. How to identify and extract this data properly Here is a guide:<\/span><\/p>\n<p><b>Identifying Key Data Elements<\/b><\/p>\n<p><b>Properties Details: <\/b><span style=\"font-weight: 400;\">Main characteristics such as property title, type (apartment\/house), number of rooms and additional attributes. Open Airbnbs listing pages and inspect their HTML structure with the Developer Tools of your Browser (right-click &gt; Inpect) or by pressing F12. Identify the respective classes or HTML tags that contain these details and do it consistently<\/span><\/p>\n<p><b>Pricing:<\/b><span style=\"font-weight: 400;\"> Pricing information, such as per-night rates and fees Normally prices are shown in specific tags (e.g. `<\/span><span style=\"font-weight: 400;\">&lt;span&gt;`<\/span><span style=\"font-weight: 400;\"> or `<\/span><span style=\"font-weight: 400;\">&lt;div&gt;`<\/span><span style=\"font-weight: 400;\">):<\/span><\/p>\n<p><b>Location: <\/b><span style=\"font-weight: 400;\">\u00a0Most listings include an approximate location in the form of city, neighborhood or distance to nearby landmarks instead of a precise address. You can get this information in meta tags or some descriptive field inside the page.<\/span><\/p>\n<p><b>Review and Rating Details:<\/b><span style=\"font-weight: 400;\"> Airbnb listings also provide detailed breakdowns of user ratings across different categories such as cleanliness, communication, check-in, location, accuracy, and value.<\/span><\/p>\n<p><b>Handling Pagination<\/b><\/p>\n<p><b>\u00a0pagination structure:<\/b><span style=\"font-weight: 400;\"> In Airbnb generally at the bottom of search pages there will be a system to paginate, load more items. These pages are often paginated with something like &#8220;Next&#8221; links or even direct page numbers, so you can paginate through the data.<\/span><\/p>\n<p><b>Automate Page Navigation :<\/b><span style=\"font-weight: 400;\"> You can automate the click on \u201cNext\u201d button for each page using Selenium. Or else, if the page URL have known parameters for pagination (ex:page=2), then you can simply append to these parameter within your code and fetch listings in batches.<\/span><\/p>\n<p><b>Pull Data from Multiple Listings<\/b><\/p>\n<p><b>Iterating Over Listings: <\/b><span style=\"font-weight: 400;\">We can iterate the listings once we have located the DOM elements representing them on a page.<\/span><\/p>\n<p><b>Storing and Structuring the Data <\/b><span style=\"font-weight: 400;\">: Store the collected data in a DataFrame using pandas allowing you to save it as CSV or analyze, table-like format.<\/span><\/p>\n<p><b>\u00a0<\/b><span style=\"font-weight: 400;\">Keep An Eye Out For Dynamic Content Airbnb changes their website structure often so your script has to be updated as new elements or layout is used.<\/span><\/p>\n<h1 id=\"Get Dynamic Content from Listing Pages\"><b>Fetching Data: Using the selenium-stealth Library to Get Dynamic Content from Listing Pages<\/b><\/h1>\n<p><span style=\"font-weight: 400;\">When scraping Airbnb listings, a lot of the important information\u2014like prices or availability\u2014may not be immediately available in the static HTML. This is because it&#8217;s often loaded dynamically via JavaScript after the page fully loads. To deal with this, we can use the <\/span><b>selenium-stealth<\/b><span style=\"font-weight: 400;\"> library to automate a browser and fetch the fully loaded content.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here\u2019s a simple example of fetching and printing the page content:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">from selenium import webdriver\r\nfrom selenium_stealth import stealth\r\nimport time\r\nimport re\r\n\r\noptions = webdriver.ChromeOptions()\r\noptions.add_argument(\"start-maximized\")\r\n\r\n# options.add_argument(\"--headless\")\r\n\r\noptions.add_experimental_option(\"excludeSwitches\", [\"enable-automation\"])\r\noptions.add_experimental_option('useAutomationExtension', False)\r\ndriver = webdriver.Chrome(options=options)\r\n\r\n\r\n# Stealth setup to avoid detection\r\nstealth(driver,\r\n        languages=[\"en-US\", \"en\"],\r\n        vendor=\"Google Inc.\",\r\n        platform=\"Win32\",\r\n        webgl_vendor=\"Intel Inc.\",\r\n        renderer=\"Intel Iris OpenGL Engine\",\r\n        fix_hairline=True,\r\n        )\r\n\r\n\r\n# Navigate to the listing page\r\nurl = \"https:\/\/www.airbnb.com\/s\/United-States\/homes?tab_id=home_tab&amp;refinement_paths%5B%5D=%2Fhomes&amp;flexible_trip_lengths%5B%5D=one_week&amp;monthly_start_date=2024-11-01&amp;monthly_length=3&amp;monthly_end_date=2025-02-01&amp;price_filter_input_type=0&amp;channel=EXPLORE&amp;query=United%20States&amp;place_id=ChIJCzYy5IS16lQRQrfeQ5K5Oxw&amp;date_picker_type=calendar&amp;source=structured_search_input_header&amp;search_type=user_map_move&amp;search_mode=regular_search&amp;price_filter_num_nights=5&amp;ne_lat=78.7534545389953&amp;ne_lng=17.82560738379206&amp;sw_lat=-36.13028852123955&amp;sw_lng=-124.379810004604&amp;zoom=2.613816079556603&amp;zoom_level=2.613816079556603&amp;search_by_map=true\"\r\n\r\ndriver.get(url)\r\n\r\n\r\n# Fetch and print the page source\r\nhtml_content = driver.page_source\r\nprint(html_content)\r\ndriver.quit()<\/pre>\n<p><strong>Code explanation:<\/strong><\/p>\n<ol>\n<li><strong>Chrome Setup<\/strong>: Configures Chrome with options to avoid detection (e.g., starting maximized, hiding automation flags).<\/li>\n<li><strong>Stealth Mode<\/strong>: Uses <code>selenium_stealth<\/code> to mimic a human user (adjusts language, platform, renderer).<\/li>\n<li><strong>Navigate &amp; Capture<\/strong>: Opens the specified Airbnb URL, captures the HTML content with <code>driver.page_source<\/code>.<\/li>\n<li><strong>Close Browser<\/strong>: Ends the session with <code>driver.quit()<\/code><\/li>\n<\/ol>\n<h1 id=\"Using Regex\"><b>\u00a0<\/b><b>Extracting Key Information from HTML Using Regex<\/b><\/h1>\n<p><span style=\"font-weight: 400;\">Once we have the HTML content using <\/span><b>selenium-stealth<\/b><span style=\"font-weight: 400;\">, the next step is to pass the HTML to <\/span><b>regex<\/b><span style=\"font-weight: 400;\"> for extracting important information such as details page URLs, prices, and other key details. By using regex, we can efficiently target specific patterns within the HTML without needing to parse the entire document structure.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here\u2019s a simple example of how to extract key information like the details page URL:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import re\r\n\r\n# Define a regex pattern to capture all property URLs from listing pages\r\nurl_pattern = 'labelledby=\"[^\"]+\" href=\"(\/rooms\/d+[^\"]+)\"'\r\n\r\n# Find all matching URLs in the HTML content\r\nurls = re.findall(url_pattern, html_content)\r\nprint(len(urls))\r\n\r\n \r\nurl_list = [] #Storing all URLs in a Python list\r\n\r\nfor url in urls:\r\n    details_page_url =  \"https:\/\/www.airbnb.com\"+url\r\n    print(details_page_url) # Print extracted URLs\r\n    url_list.append(details_page_url)<\/pre>\n<p>This regex pattern is used to capture Airbnb property URLs from an HTML content string.<\/p>\n<h3>Code Explanation<\/h3>\n<ul>\n<li><code>urls = re.findall(url_pattern, html_content)<\/code>: Finds all instances of URLs that match <code>url_pattern<\/code> in <code>html_content<\/code>. Each match is added to the <code>urls<\/code> list.<\/li>\n<li>The <code>for<\/code> loop:\n<ul>\n<li>Iterates through each matched <code>url<\/code> in <code>urls<\/code>.<\/li>\n<li>Prepends each URL with the base <code>https:\/\/www.airbnb.com<\/code>, forming a complete URL.<\/li>\n<li>Prints each URL and appends it to <code>url_list<\/code> for later use.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h1 id=\"Handling Pagination\"><b>Handling Pagination: Navigating Through Multiple Pages of Listings Efficiently<\/b><\/h1>\n<p><span style=\"font-weight: 400;\">To make scraping more flexible, you can allow the user to input how many pages they want to scrape. This ensures the scraper clicks through the exact number of pages requested and stops automatically. Here\u2019s how you can modify the pagination logic to accept user input for the number of pages:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">from selenium import webdriver\r\nfrom selenium_stealth import stealth\r\nfrom selenium.webdriver.common.by import By\r\nfrom selenium.webdriver.support.ui import WebDriverWait\r\nfrom selenium.webdriver.support import expected_conditions as EC\r\nimport time\r\nimport re\r\nimport pandas as pd\r\n\r\noptions = webdriver.ChromeOptions()\r\noptions.add_argument(\"start-maximized\")\r\n# options.add_argument(\"--headless\")\r\noptions.add_experimental_option(\"excludeSwitches\", [\"enable-automation\"])\r\noptions.add_experimental_option('useAutomationExtension', False)\r\ndriver = webdriver.Chrome(options=options)\r\n\r\n# Stealth setup to avoid detection\r\nstealth(driver,\r\n        languages=[\"en-US\", \"en\"],\r\n        vendor=\"Google Inc.\",\r\n        platform=\"Win32\",\r\n        webgl_vendor=\"Intel Inc.\",\r\n        renderer=\"Intel Iris OpenGL Engine\",\r\n        fix_hairline=True,\r\n        )\r\n\r\n# Function to scrape the current page and return all property URLs\r\ndef scrape_current_page():\r\n    html_content = driver.page_source\r\n    url_pattern = 'labelledby=\"[^\"]+\" href=\"(\/rooms\/d+[^\"]+)\"'\r\n    urls = re.findall(url_pattern, html_content)\r\n    return urls\r\n\r\n# Function to scroll to the bottom of the page\r\ndef scroll_to_bottom():\r\n    driver.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\r\n    time.sleep(2)  # Give time for the page to load additional content\r\n\r\n# Function to wait for the \"Next\" button and click it\r\ndef go_to_next_page():\r\n    try:\r\n        # Wait until the \"Next\" button is clickable\r\n        next_button = WebDriverWait(driver, 10).until(\r\n            EC.element_to_be_clickable((By.CSS_SELECTOR, \"a[aria-label='Next']\"))\r\n        )\r\n        scroll_to_bottom()  # Scroll to the bottom of the page before clicking\r\n        next_button.click()\r\n        return True\r\n    except Exception as e:\r\n        print(f\"Couldn't navigate to next page: {e}\")\r\n        return False\r\n\r\n# base url\r\nurl = \"https:\/\/www.airbnb.com\/s\/United-States\/homes?flexible_trip_lengths%5B%5D=one_week&amp;date_picker_type=flexible_dates&amp;place_id=ChIJCzYy5IS16lQRQrfeQ5K5Oxw&amp;refinement_paths%5B%5D=%2Fhomes&amp;search_type=AUTOSUGGEST\"\r\ndriver.get(url)\r\n\r\n# Ask the user how many pages to scrape\r\nnum_pages = int(input(\"How many pages do you want to scrape? \"))\r\n\r\nurl_list = []  # Storing all URLs in a Python list\r\n\r\n# Scrape the specified number of pages\r\nfor page in range(num_pages):\r\n    print(f\"Scraping page {page + 1}...\")\r\n   \r\n    # Scrape URLs from the current page\r\n    urls = scrape_current_page()\r\n    for url in urls:\r\n        details_page_url = \"https:\/\/www.airbnb.com\" + url\r\n        print(details_page_url)  # Print extracted URLs\r\n        url_list.append(details_page_url)\r\n   \r\n    # Try to go to the next page\r\n    if not go_to_next_page():\r\n        break  # If there's no \"Next\" button or an error occurs, stop the loop\r\n   \r\n    # Wait for the next page to load\r\n    time.sleep(3)\r\n\r\n# After scraping is complete, print the total number of URLs\r\nprint(f\"Total URLs scraped: {len(url_list)}\")<\/pre>\n<p><strong>code explanation:<\/strong><\/p>\n<ol>\n<li><strong>Page Navigation Loop<\/strong>: The code iterates through multiple pages based on the <code>num_pages<\/code> input, scraping each page&#8217;s URLs.<\/li>\n<li><strong>Scrape Current Page<\/strong>: <code>scrape_current_page()<\/code> extracts property URLs from the HTML of the current page using regex.<\/li>\n<li><strong>Scroll to Bottom<\/strong>: <code>scroll_to_bottom()<\/code> scrolls to the bottom of the page, ensuring any lazy-loaded content is loaded.<\/li>\n<li><strong>Next Page Button<\/strong>: <code>go_to_next_page()<\/code> waits for the &#8220;Next&#8221; button to appear and scrolls to the bottom before clicking it. If the button is clickable, it moves to the next page; if not, it stops the loop.<\/li>\n<li><strong>Repeat<\/strong>: This process repeats for each page until the specified <code>num_pages<\/code> is reached or no &#8220;Next&#8221; button is found.<\/li>\n<\/ol>\n<p>This pagination approach allows the code to move seamlessly through multiple pages, scraping each page&#8217;s data until reaching the end.<\/p>\n<h1 id=\"Scrape Detail Page\"><b>Handling Dynamic Content: Using Selenium Stealth to Scrape Detail Pages Loaded with JavaScript<\/b><\/h1>\n<p><span style=\"font-weight: 400;\">After gathering the list of detail page URLs, the next step is to loop through each URL and extract the required information. Since these detail pages load dynamic content via JavaScript, <\/span><b>Selenium Stealth<\/b><span style=\"font-weight: 400;\"> ensures the page is fully loaded before we extract the data using <\/span><b>regex<\/b><span style=\"font-weight: 400;\"> to find and pull specific information from the HTML.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here\u2019s how you can handle this:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><b>Loop Through URLs<\/b><span style=\"font-weight: 400;\">: Iterate through each URL from your list.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Load the Page<\/b><span style=\"font-weight: 400;\">: Use Selenium Stealth to fully load each detail page.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Extract Data with Regex<\/b><span style=\"font-weight: 400;\">: Use regex to extract specific data such as pricing, reviews, or descriptions.<\/span><\/li>\n<\/ol>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\"># function to scrape information from a details page (title, price, etc.)\r\ndef scrape_details_page(url):\r\n    try:\r\n        driver.get(url)\r\n        # Wait for the page to load (you can adjust this)\r\n        time.sleep(2)\r\n        html_content = driver.page_source\r\n        scroll_to_bottom()\r\n        time.sleep(2)\r\n        # Regex pattern for scraping the title\r\n        title_pattern = r'&lt;h1[^&gt;]+&gt;([^&lt;]+)&lt;\/h1&gt;'\r\n   \r\n        # Scrape the title (adjust the selector according to the page structure)\r\n        title = re.search(title_pattern,html_content)\r\n        if title:\r\n           title = title.group(1)\r\n        else:\r\n            title = None\r\n       \r\n        price_pattern = r'($d+[^&lt;]+)&lt;\/span&gt;&lt;\/span&gt;[^&gt;]+&gt;&lt;\/div&gt;&lt;\/div&gt;'\r\n        price = re.search(price_pattern,html_content)\r\n   \r\n        if price:\r\n            price = price.group(1)\r\n        else:\r\n            price = None\r\n\r\n        address_pattern = r'dir-ltr\"&gt;&lt;div[^&gt;]+&gt;&lt;section&gt;&lt;div[^&gt;]+ltr\"&gt;&lt;h2[^&gt;]+&gt;([^&lt;]+)&lt;\/h2&gt;'\r\n        address =  re.search(address_pattern,html_content)\r\n        if address:\r\n           address =  address.group(1)\r\n        else:\r\n            address = None\r\n       \r\n        guest_pattern = r'&lt;li class=\"l7n4lsf[^&gt;]+&gt;([^&lt;]+)&lt;span'\r\n        guest =   re.search(guest_pattern,html_content)\r\n        if guest:\r\n           guest = guest.group(1)\r\n        else:\r\n            guest = None\r\n        # You can add more information to scrape (example: price, description, etc.)\r\n       \r\n        bed_bath_pattern = r'&lt;\/span&gt;(d+[^&lt;]+)'\r\n        bed_bath = re.findall(bed_bath_pattern,html_content)\r\n        bed_bath_details = []\r\n        if bed_bath:\r\n            for bed_bath_info in bed_bath:\r\n                bed_bath_details.append(bed_bath_info.strip())\r\n       \r\n        reviews_pattern = r'l1nqfsv9[^&gt;]+&gt;([^&lt;]+)&lt;\/div&gt;[^&gt;]+&gt;(d+[^&lt;]+)&lt;\/div&gt;'\r\n        reviews_details =  re.findall(reviews_pattern,html_content)\r\n        review_list = []\r\n        if reviews_details:\r\n               for review in reviews_details:\r\n                    attribute, rating = review  # Unpack the attribute and rating\r\n                    review_list.append(f'{attribute} {rating}')  # Combine into a readable format\r\n\r\n\r\n        host_name_pattern = r't1gpcl1t atm_w4_16rzvi6 atm_9s_1o8liyq atm_gi_idpfg4 dir dir-ltr[^&gt;]+&gt;([^&lt;]+)'\r\n        host_name =  re.search(host_name_pattern,html_content)\r\n        if host_name:\r\n           host_name = host_name.group(1)    \r\n        else:\r\n            host_name = None\r\n\r\n        total_review_pattern = r'pdp-reviews-[^&gt;]+&gt;[^&gt;]+&gt;(d+[^&lt;]+)&lt;\/span&gt;'\r\n        total_review =  re.search(total_review_pattern,html_content)\r\n        if total_review:\r\n           total_review =  total_review.group(1)    \r\n        else:\r\n            total_review = None\r\n\r\n\r\n        host_info_pattern = r'd1u64sg5[^\"]+atm_67_1vlbu9m dir dir-ltr[^&gt;]+&gt;&lt;div&gt;&lt;span[^&gt;]+&gt;([^&lt;]+)'\r\n        host_info = re.findall(host_info_pattern,html_content)\r\n        host_info_list = []\r\n        if host_info:\r\n            for host_info_details in host_info:\r\n                 host_info_list.append(host_info_details)\r\n       \r\n        # Print the scraped information (for debugging purposes)\r\n        print(f\"Title: {title}n Price:{price}n Address: {address}n Guest: {guest}n bed_bath_details:{bed_bath_details}n Reviews: {review_list}n Host_name: {host_name}n total_review: {total_review}n Host Info: {host_info_list}n \")\r\n       \r\n        # Return the information as a dictionary (or adjust based on your needs)\r\n          # Store the scraped information in a dictionary\r\n        return {\r\n            \"url\": url,\r\n            \"Title\": title,\r\n            \"Price\": price,\r\n            \"Address\": address,\r\n            \"Guest\": guest,\r\n            \"Bed_Bath_Details\": bed_bath_details,\r\n            \"Reviews\": review_list,\r\n            \"Host_Name\": host_name,\r\n            \"Total_Reviews\": total_review,\r\n            \"Host_Info\": host_info\r\n        }\r\n    except Exception as e:\r\n        print(f\"Error scraping {url}: {e}\")\r\n        return None\r\n\r\n\r\n# Scrape the details page for each URL stored in the url_list  \r\nfor url in url_list:\r\n    print(f\"Scraping details from: {url}\")\r\n    data = scrape_details_page(url)<\/pre>\n<p><span style=\"font-weight: 400;\">In this approach, we load each detail page using <\/span><b>Selenium Stealth<\/b><span style=\"font-weight: 400;\"> to ensure dynamic JavaScript content is fully loaded, and then use <\/span><b>regex<\/b><span style=\"font-weight: 400;\"> to extract specific data directly from the HTML content. This method is efficient and flexible for scraping pages with dynamically loaded content.<\/span><\/p>\n<h1 id=\"Saving data to csv\"><b>Saving the Data: Storing Scraped Data Using Pandas in CSV\/Excel Formats<\/b><\/h1>\n<p><span style=\"font-weight: 400;\">After scraping and extracting data from the Detail pages now its time to store that data in an effective way. Pandas makes it very easy to save the information extract in popular formats like CSV or Excel (popular for data analysis\/sharing).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is how you can do it:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">\u00a0<\/span><b>\u00a0Step 1: <\/b><span style=\"font-weight: 400;\">Create a DataFrame After fetching data, we need to place it in Pandas DataFrame.<\/span><\/p>\n<p><b>\u00a0Step 2: <\/b><span style=\"font-weight: 400;\">Save to CSV\/Excel : Save the Dataframe as a csv or excel file with pandas functions.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Example:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import pandas as pd\r\n\r\n# Function to save data to CSV using pandas\r\ndef save_to_csv(data, filename='airbnb_data.csv'):\r\n    df = pd.DataFrame(data)\r\n    df.to_csv(filename, index=False)\r\n    print(f\"Data saved to {filename}\")\r\n\r\n\r\n\r\n\r\nscraped_data = []\r\n\r\n\r\n# Scrape the details page for each URL stored in the url_list  \r\nfor url in url_list:\r\n    print(f\"Scraping details from: {url}\")\r\n    data = scrape_details_page(url)\r\n    if data:\r\n        scraped_data.append(data)\r\n     \r\n\r\n# After scraping, save data to CSV\r\nif scraped_data:\r\n    save_to_csv(scraped_data)\r\nelse:\r\n    print(\"No data to save.\")<\/pre>\n<p><span style=\"font-weight: 400;\">By using <\/span><b>Pandas<\/b><span style=\"font-weight: 400;\">, you can easily manage and store the scraped data in structured formats, making it simple to use for further analysis or sharing with others. Here is the screenshot of csv result:<\/span><\/p>\n<p><span style=\"font-weight: 400;\"> <img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1290 size-full\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/air_bnb.png\" alt=\"\" width=\"1917\" height=\"1038\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/air_bnb.png 1917w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/air_bnb-300x162.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/air_bnb-1024x554.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/air_bnb-768x416.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/air_bnb-1536x832.png 1536w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/10\/air_bnb-624x338.png 624w\" sizes=\"auto, (max-width: 1917px) 100vw, 1917px\" \/><\/span><\/p>\n<p><strong>here is full code:<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">from selenium import webdriver\r\nfrom selenium_stealth import stealth\r\nfrom selenium.webdriver.common.by import By\r\nfrom selenium.webdriver.support.ui import WebDriverWait\r\nfrom selenium.webdriver.support import expected_conditions as EC\r\nimport time\r\nimport re\r\nimport pandas as pd\r\n\r\noptions = webdriver.ChromeOptions()\r\noptions.add_argument(\"start-maximized\")\r\n# options.add_argument(\"--headless\")\r\noptions.add_experimental_option(\"excludeSwitches\", [\"enable-automation\"])\r\noptions.add_experimental_option('useAutomationExtension', False)\r\ndriver = webdriver.Chrome(options=options)\r\n\r\n# Stealth setup to avoid detection\r\nstealth(driver,\r\n        languages=[\"en-US\", \"en\"],\r\n        vendor=\"Google Inc.\",\r\n        platform=\"Win32\",\r\n        webgl_vendor=\"Intel Inc.\",\r\n        renderer=\"Intel Iris OpenGL Engine\",\r\n        fix_hairline=True,\r\n        )\r\n\r\n# Function to scrape the current page and return all property URLs\r\ndef scrape_current_page():\r\n    html_content = driver.page_source\r\n    url_pattern = 'labelledby=\"[^\"]+\" href=\"(\/rooms\/d+[^\"]+)\"'\r\n    urls = re.findall(url_pattern, html_content)\r\n    return urls\r\n\r\n# Function to scroll to the bottom of the page\r\ndef scroll_to_bottom():\r\n    driver.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\r\n    time.sleep(2)  # Give time for the page to load additional content\r\n\r\n# Function to wait for the \"Next\" button and click it\r\ndef go_to_next_page():\r\n    try:\r\n        # Wait until the \"Next\" button is clickable\r\n        next_button = WebDriverWait(driver, 10).until(\r\n            EC.element_to_be_clickable((By.CSS_SELECTOR, \"a[aria-label='Next']\"))\r\n        )\r\n        scroll_to_bottom()  # Scroll to the bottom of the page before clicking\r\n        next_button.click()\r\n        return True\r\n    except Exception as e:\r\n        print(f\"Couldn't navigate to next page: {e}\")\r\n        return False\r\n\r\n# base url\r\nurl = \"https:\/\/www.airbnb.com\/s\/United-States\/homes?flexible_trip_lengths%5B%5D=one_week&amp;date_picker_type=flexible_dates&amp;place_id=ChIJCzYy5IS16lQRQrfeQ5K5Oxw&amp;refinement_paths%5B%5D=%2Fhomes&amp;search_type=AUTOSUGGEST\"\r\ndriver.get(url)\r\n\r\n# Ask the user how many pages to scrape\r\nnum_pages = int(input(\"How many pages do you want to scrape? \"))\r\n\r\nurl_list = []  # Storing all URLs in a Python list\r\n\r\n# Scrape the specified number of pages\r\nfor page in range(num_pages):\r\n    print(f\"Scraping page {page + 1}...\")\r\n    \r\n    # Scrape URLs from the current page\r\n    urls = scrape_current_page()\r\n    for url in urls:\r\n        details_page_url = \"https:\/\/www.airbnb.com\" + url\r\n        print(details_page_url)  # Print extracted URLs\r\n        url_list.append(details_page_url)\r\n    \r\n    # Try to go to the next page\r\n    if not go_to_next_page():\r\n        break  # If there's no \"Next\" button or an error occurs, stop the loop\r\n    \r\n    # Wait for the next page to load\r\n    time.sleep(3)\r\n\r\n# After scraping is complete, print the total number of URLs\r\nprint(f\"Total URLs scraped: {len(url_list)}\")\r\n\r\n\r\n\r\n\r\n\r\n# function to scrape information from a details page (title, price, etc.)\r\ndef scrape_details_page(url):\r\n    try:\r\n        driver.get(url)\r\n        # Wait for the page to load (you can adjust this)\r\n        time.sleep(2)\r\n        html_content = driver.page_source\r\n        scroll_to_bottom()\r\n        time.sleep(2) \r\n        # Regex pattern for scraping the title\r\n        title_pattern = r'&lt;h1[^&gt;]+&gt;([^&lt;]+)&lt;\/h1&gt;'\r\n    \r\n        # Scrape the title (adjust the selector according to the page structure)\r\n        title = re.search(title_pattern,html_content)\r\n        if title:\r\n           title = title.group(1)\r\n        else:\r\n            title = None\r\n        \r\n        price_pattern = r'($d+[^&lt;]+)&lt;\/span&gt;&lt;\/span&gt;[^&gt;]+&gt;&lt;\/div&gt;&lt;\/div&gt;'\r\n        price = re.search(price_pattern,html_content)\r\n    \r\n        if price:\r\n            price = price.group(1)\r\n        else:\r\n            price = None\r\n\r\n        address_pattern = r'dir-ltr\"&gt;&lt;div[^&gt;]+&gt;&lt;section&gt;&lt;div[^&gt;]+ltr\"&gt;&lt;h2[^&gt;]+&gt;([^&lt;]+)&lt;\/h2&gt;'\r\n        address =  re.search(address_pattern,html_content)\r\n        if address:\r\n           address =  address.group(1)\r\n        else:\r\n            address = None\r\n        \r\n        guest_pattern = r'&lt;li class=\"l7n4lsf[^&gt;]+&gt;([^&lt;]+)&lt;span'\r\n        guest =   re.search(guest_pattern,html_content)\r\n        if guest:\r\n           guest = guest.group(1)\r\n        else:\r\n            guest = None\r\n        # You can add more information to scrape (example: price, description, etc.)\r\n        \r\n        bed_bath_pattern = r'&lt;\/span&gt;(d+[^&lt;]+)'\r\n        bed_bath = re.findall(bed_bath_pattern,html_content)\r\n        bed_bath_details = [] \r\n        if bed_bath:\r\n            for bed_bath_info in bed_bath:\r\n                bed_bath_details.append(bed_bath_info.strip())\r\n        \r\n        reviews_pattern = r'l1nqfsv9[^&gt;]+&gt;([^&lt;]+)&lt;\/div&gt;[^&gt;]+&gt;(d+[^&lt;]+)&lt;\/div&gt;'\r\n        reviews_details =  re.findall(reviews_pattern,html_content)\r\n        review_list = []\r\n        if reviews_details:\r\n               for review in reviews_details:\r\n                    attribute, rating = review  # Unpack the attribute and rating\r\n                    review_list.append(f'{attribute} {rating}')  # Combine into a readable format\r\n\r\n\r\n        host_name_pattern = r't1gpcl1t atm_w4_16rzvi6 atm_9s_1o8liyq atm_gi_idpfg4 dir dir-ltr[^&gt;]+&gt;([^&lt;]+)'\r\n        host_name =  re.search(host_name_pattern,html_content)\r\n        if host_name:\r\n           host_name = host_name.group(1)    \r\n        else:\r\n            host_name = None\r\n\r\n        total_review_pattern = r'pdp-reviews-[^&gt;]+&gt;[^&gt;]+&gt;(d+[^&lt;]+)&lt;\/span&gt;'\r\n        total_review =  re.search(total_review_pattern,html_content)\r\n        if total_review:\r\n           total_review =  total_review.group(1)    \r\n        else:\r\n            total_review = None\r\n\r\n\r\n        host_info_pattern = r'd1u64sg5[^\"]+atm_67_1vlbu9m dir dir-ltr[^&gt;]+&gt;&lt;div&gt;&lt;span[^&gt;]+&gt;([^&lt;]+)'\r\n        host_info = re.findall(host_info_pattern,html_content)\r\n        host_info_list = []\r\n        if host_info:\r\n            for host_info_details in host_info:\r\n                 host_info_list.append(host_info_details)\r\n        \r\n        # Print the scraped information (for debugging purposes)\r\n        print(f\"Title: {title}n Price:{price}n Address: {address}n Guest: {guest}n bed_bath_details:{bed_bath_details}n Reviews: {review_list}n Host_name: {host_name}n total_review: {total_review}n Host Info: {host_info_list}n \")\r\n        \r\n        # Return the information as a dictionary (or adjust based on your needs)\r\n          # Store the scraped information in a dictionary\r\n        return {\r\n            \"url\": url,\r\n            \"Title\": title,\r\n            \"Price\": price,\r\n            \"Address\": address,\r\n            \"Guest\": guest,\r\n            \"Bed_Bath_Details\": bed_bath_details,\r\n            \"Reviews\": review_list,\r\n            \"Host_Name\": host_name,\r\n            \"Total_Reviews\": total_review,\r\n            \"Host_Info\": host_info\r\n        }\r\n    except Exception as e:\r\n        print(f\"Error scraping {url}: {e}\")\r\n        return None\r\n\r\n\r\n# Function to save data to CSV using pandas\r\ndef save_to_csv(data, filename='airbnb_data.csv'):\r\n    df = pd.DataFrame(data)\r\n    df.to_csv(filename, index=False)\r\n    print(f\"Data saved to {filename}\")\r\n\r\n\r\n\r\nscraped_data = []\r\n\r\n\r\n# Scrape the details page for each URL stored in the url_list  \r\nfor url in url_list:\r\n    print(f\"Scraping details from: {url}\")\r\n    data = scrape_details_page(url)\r\n    if data:\r\n        scraped_data.append(data)\r\n     \r\n\r\n# After scraping, save data to CSV\r\nif scraped_data:\r\n    save_to_csv(scraped_data)\r\nelse:\r\n    print(\"No data to save.\")\r\n\r\n# Close the browser\r\ndriver.quit()\r\n    \r\n\r\n\r\n\r\n\r\n\r\n<\/pre>\n<h1 id=\"Techniques to Prevent Getting Blocked\"><b>Avoiding Blocks: Techniques to Prevent Getting Blocked<\/b><\/h1>\n<p><span style=\"font-weight: 400;\">Especially when scraping websites such as the Airbnb website which has a high traffic from single IP causing to get blocking or getting detected as bot. Well, such things can be stopped using proxies. Proxies help because they hide your true IP address, so it looks like the requests are coming from different places.<\/span><\/p>\n<p><b>Why Use Proxies?<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Without a proxy, you can get rate limited or completely blocked for sending too many requests from the same IP address.\u00a0 To escape this you can use a proxy which hides your IP.<\/span><\/p>\n<p><b>Why Use Rotating Proxies?<\/b><\/p>\n<p><b>Preventing IP Bans:<\/b><span style=\"font-weight: 400;\"> Rotating proxies change the current IP address with each request (or every few requests) to decrease your scraper from being detected and thereby banned.<\/span><\/p>\n<p><b>How to bypass Captchas: <\/b><span style=\"font-weight: 400;\">Sometimes, the website will see a lot of traffic coming from one IP address and place captchas on it. Rotating proxies will spread requests across numerous IPs, helping to avoid captchas.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">I have given an example of spinning proxies with Rayobyte in this tutorial, but you can use any other reliable proxy provider that supports rotation. Rotating proxies reduce the likelihood of captchas and IP bans, allowing you to scrape faster for longer periods.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Example of using proxy:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import pandas as pd\r\nimport re\r\nimport time\r\nimport random\r\nfrom selenium import webdriver\r\nfrom selenium_stealth import stealth\r\nfrom selenium.webdriver.common.by import By\r\nfrom selenium.webdriver.support.ui import WebDriverWait\r\nfrom selenium.webdriver.support import expected_conditions as EC\r\n\r\n\r\n# Function to create proxy authentication extension\r\ndef create_proxy_auth_extension(proxy_host, proxy_user, proxy_pass):\r\n    import zipfile\r\n    import os\r\n\r\n    # Separate the host and port\r\n    host = proxy_host.split(':')[0]\r\n    port = proxy_host.split(':')[1]\r\n\r\n    # Define proxy extension files\r\n    manifest_json = \"\"\"\r\n    {\r\n        \"version\": \"1.0.0\",\r\n        \"manifest_version\": 2,\r\n        \"name\": \"Chrome Proxy\",\r\n        \"permissions\": [\r\n            \"proxy\",\r\n            \"tabs\",\r\n            \"unlimitedStorage\",\r\n            \"storage\",\r\n            \"&lt;all_urls&gt;\",\r\n            \"webRequest\",\r\n            \"webRequestBlocking\"\r\n        ],\r\n        \"background\": {\r\n            \"scripts\": [\"background.js\"]\r\n        },\r\n        \"minimum_chrome_version\":\"22.0.0\"\r\n    }\r\n    \"\"\"\r\n    \r\n    background_js = f\"\"\"\r\n    var config = {{\r\n            mode: \"fixed_servers\",\r\n            rules: {{\r\n              singleProxy: {{\r\n                scheme: \"http\",\r\n                host: \"{host}\",\r\n                port: parseInt({port})\r\n              }},\r\n              bypassList: [\"localhost\"]\r\n            }}\r\n          }};\r\n    chrome.proxy.settings.set({{value: config, scope: \"regular\"}}, function() {{}});\r\n\r\n    chrome.webRequest.onAuthRequired.addListener(\r\n        function(details) {{\r\n            return {{\r\n                authCredentials: {{\r\n                    username: \"{proxy_user}\",\r\n                    password: \"{proxy_pass}\"\r\n                }}\r\n            }};\r\n        }},\r\n        {{urls: [\"&lt;all_urls&gt;\"]}},\r\n        [\"blocking\"]\r\n    );\r\n    \"\"\"\r\n\r\n    # Create the extension\r\n    pluginfile = 'proxy_auth_plugin.zip'\r\n    with zipfile.ZipFile(pluginfile, 'w') as zp:\r\n        zp.writestr(\"manifest.json\", manifest_json)\r\n        zp.writestr(\"background.js\", background_js)\r\n\r\n    return pluginfile\r\n\r\n\r\n# Function to configure and return the WebDriver with proxy\r\ndef init_driver_with_proxy(proxy_server, proxy_username, proxy_password):\r\n    options = webdriver.ChromeOptions()\r\n    options.add_argument(\"start-maximized\")\r\n\r\n    # Add proxy authentication if necessary\r\n    if proxy_username and proxy_password:\r\n        options.add_extension(create_proxy_auth_extension(proxy_server, proxy_username, proxy_password))\r\n\r\n    # Stealth mode to avoid detection\r\n    driver = webdriver.Chrome(options=options)\r\n    stealth(driver,\r\n            languages=[\"en-US\", \"en\"],\r\n            vendor=\"Google Inc.\",\r\n            platform=\"Win32\",\r\n            webgl_vendor=\"Intel Inc.\",\r\n            renderer=\"Intel Iris OpenGL Engine\",\r\n            fix_hairline=True,\r\n            )\r\n    return driver\r\n\r\n\r\n# Proxy pool for rotation (list of proxy servers)\r\nproxy_pool = [\r\n    {\"proxy\": \"proxy1.com:8000\", \"username\": \"user1\", \"password\": \"pass1\"},\r\n    {\"proxy\": \"proxy2.com:8000\", \"username\": \"user2\", \"password\": \"pass2\"},\r\n    {\"proxy\": \"proxy3.com:8000\", \"username\": \"user3\", \"password\": \"pass3\"}\r\n   \r\n]\r\n\r\n# Function to scrape details page (rotate proxy on each request)\r\ndef scrape_details_page(url):\r\n    try:\r\n        # Rotate proxy by choosing a random one from the pool\r\n        proxy = random.choice(proxy_pool)\r\n        driver = init_driver_with_proxy(proxy['proxy'], proxy['username'], proxy['password'])\r\n\r\n        driver.get(url)\r\n        time.sleep(3)  # Wait for the page to load\r\n\r\n        html_content = driver.page_source\r\n\r\n        # Regex pattern for scraping the title\r\n        title_pattern = r'&lt;h1[^&gt;]+&gt;([^&lt;]+)&lt;\/h1&gt;'\r\n    \r\n        # Scrape the title  \r\n        title = re.search(title_pattern,html_content)\r\n        if title:\r\n           title = title.group(1)\r\n        else:\r\n            title = None\r\n\r\n        # Scrape the price  \r\n        price_pattern = r'($d+[^&lt;]+)&lt;\/span&gt;&lt;\/span&gt;[^&gt;]+&gt;&lt;\/div&gt;&lt;\/div&gt;'\r\n        price = re.search(price_pattern,html_content)\r\n    \r\n        if price:\r\n            price = price.group(1)\r\n        else:\r\n            price = None\r\n        \r\n        # Scrape the address  \r\n        address_pattern = r'dir-ltr\"&gt;&lt;div[^&gt;]+&gt;&lt;section&gt;&lt;div[^&gt;]+ltr\"&gt;&lt;h2[^&gt;]+&gt;([^&lt;]+)&lt;\/h2&gt;'\r\n        address =  re.search(address_pattern,html_content)\r\n        if address:\r\n           address =  address.group(1)\r\n        else:\r\n            address = None\r\n\r\n        # Scrape the guest  \r\n        guest_pattern = r'&lt;li class=\"l7n4lsf[^&gt;]+&gt;([^&lt;]+)&lt;span'\r\n        guest =   re.search(guest_pattern,html_content)\r\n        if guest:\r\n           guest = guest.group(1)\r\n        else:\r\n            guest = None\r\n        # You can add more information to scrape (example: price, description, etc.)\r\n        \r\n        # Scrape the bedrooms, bed, bath  details  \r\n        bed_bath_pattern = r'&lt;\/span&gt;(d+[^&lt;]+)'\r\n        bed_bath = re.findall(bed_bath_pattern,html_content)\r\n        bed_bath_details = [] \r\n        if bed_bath:\r\n            for bed_bath_info in bed_bath:\r\n                bed_bath_details.append(bed_bath_info.strip())\r\n       \r\n        #scrape reviews such as Cleanliness, Accuracy, Communication etc.\r\n        reviews_pattern = r'l1nqfsv9[^&gt;]+&gt;([^&lt;]+)&lt;\/div&gt;[^&gt;]+&gt;(d+[^&lt;]+)&lt;\/div&gt;'\r\n        reviews_details =  re.findall(reviews_pattern,html_content)\r\n        review_list = []\r\n        if reviews_details:\r\n               for review in reviews_details:\r\n                    attribute, rating = review  # Unpack the attribute and rating\r\n                    review_list.append(f'{attribute} {rating}')  # Combine into a readable format\r\n\r\n        #scrape host name\r\n        host_name_pattern = r't1gpcl1t atm_w4_16rzvi6 atm_9s_1o8liyq atm_gi_idpfg4 dir dir-ltr[^&gt;]+&gt;([^&lt;]+)'\r\n        host_name =  re.search(host_name_pattern,html_content)\r\n        if host_name:\r\n           host_name = host_name.group(1)    \r\n        else:\r\n            host_name = None\r\n\r\n        #scrape total number of review\r\n        total_review_pattern = r'pdp-reviews-[^&gt;]+&gt;[^&gt;]+&gt;(d+[^&lt;]+)&lt;\/span&gt;'\r\n        total_review =  re.search(total_review_pattern,html_content)\r\n        if total_review:\r\n           total_review =  total_review.group(1)    \r\n        else:\r\n            total_review = None\r\n\r\n        #scrape host info\r\n        host_info_pattern = r'd1u64sg5[^\"]+atm_67_1vlbu9m dir dir-ltr[^&gt;]+&gt;&lt;div&gt;&lt;span[^&gt;]+&gt;([^&lt;]+)'\r\n        host_info = re.findall(host_info_pattern,html_content)\r\n        host_info_list = []\r\n        if host_info:\r\n            for host_info_details in host_info:\r\n                 host_info_list.append(host_info_details)\r\n        \r\n        # Print the scraped information (for debugging purposes)\r\n        print(f\"Title: {title}n Price:{price}n Address: {address}n Guest: {guest}n bed_bath_details:{bed_bath_details}n Reviews: {review_list}n Host_name: {host_name}n total_review: {total_review}n Host Info: {host_info_list}n \")\r\n        \r\n        # Return the information as a dictionary (or adjust based on your needs)\r\n        # Store the scraped information in a dictionary\r\n        return {\r\n            \"url\": url,\r\n            \"Title\": title,\r\n            \"Price\": price,\r\n            \"Address\": address,\r\n            \"Guest\": guest,\r\n            \"Bed_Bath_Details\": bed_bath_details,\r\n            \"Reviews\": review_list,\r\n            \"Host_Name\": host_name,\r\n            \"Total_Reviews\": total_review,\r\n            \"Host_Info\": host_info\r\n        }\r\n    except Exception as e:\r\n        print(f\"Error scraping {url}: {e}\")\r\n        return None\r\n\r\n\r\n# Function to save data to CSV using pandas\r\ndef save_to_csv(data, filename='airbnb_data.csv'):\r\n    df = pd.DataFrame(data)\r\n    df.to_csv(filename, index=False)\r\n    print(f\"Data saved to {filename}\")\r\n\r\n\r\n# List of URLs to scrape\r\nurl_list = [\"https:\/\/www.airbnb.com\/rooms\/968367851365040114?adults=1&amp;category_tag=Tag%3A8148&amp;children=0&amp;enable_m3_private_room=true&amp;infants=0&amp;pets=0&amp;photo_id=1750644422&amp;search_mode=regular_search&amp;check_in=2025-01-18&amp;check_out=2025-01-23&amp;source_impression_id=p3_1729605408_P3X7GT0Ec98R7_ET&amp;previous_page_section_name=1000&amp;federated_search_id=62850efb-a8ab-4062-92ec-e9010fc6a24f\"]  # Replace with actual URLs\r\nscraped_data = []\r\n\r\n# Scrape the details page for each URL with proxy rotation\r\nfor url in url_list:\r\n    print(f\"Scraping details from: {url}\")\r\n    data = scrape_details_page(url)\r\n    if data:\r\n        scraped_data.append(data)\r\n\r\n# After scraping, save data to CSV\r\nif scraped_data:\r\n    save_to_csv(scraped_data)\r\nelse:\r\n    print(\"No data to save.\")\r\n<\/pre>\n<h1 id=\"Legal and Ethical Issues\"><b>\u00a0Legal and Ethical Issues<\/b><\/h1>\n<p><span style=\"font-weight: 400;\">To prevent any potential problems, you must also scrape the data legally and ethically from websites. In every case you need to take a look at the Terms of Service from the website.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Avoid overloading the servers of the website with a lot of requests in a short amount. Use in moderation and honor timeouts to reduce the impact of disruption. No scraping should degrade the performance of a website and steal proprietary or sensitive data.\u00a0\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/github.com\/MDFARHYN\/airbnbScraping\" rel=\"nofollow noopener\" target=\"_blank\">Download the full code from GitHub.<\/a><\/p>\n<p><a href=\"https:\/\/www.youtube.com\/watch?v=R2YoMnqJ2rg\" rel=\"nofollow noopener\" target=\"_blank\">watch full tutorial on YouTube\u00a0<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Download the full code from GitHub. Table of content Introduction Prerequisites Understanding Airbnb\u2019s Structure for scraping Get Dynamic Content from Listing Pages Using Regex Handling&hellip;<\/p>\n","protected":false},"author":23,"featured_media":1288,"comment_status":"open","ping_status":"closed","template":"","meta":{"rank_math_lock_modified_date":false},"categories":[],"class_list":["post-1285","scraping_project","type-scraping_project","status-publish","has-post-thumbnail","hentry"],"_links":{"self":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/scraping_project\/1285","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/scraping_project"}],"about":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/types\/scraping_project"}],"author":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/users\/23"}],"replies":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/comments?post=1285"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/media\/1288"}],"wp:attachment":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/media?parent=1285"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/categories?post=1285"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}