Airbnb Web Scraping with Python: Extract Listings and Pricing Data

Download the full code from GitHub.

Table of content

airbnb scraping with python web scraping guide

Introduction

By scraping Airbnb, businesses and researchers can find out more about rental trends and consumer preferences as well as how the pricing dynamic works. This reports rich data has immeasurable benefits for competitive analysis, location-based investment decisions and understanding seasonal demand shifts. But only if it can be done legally by following Airbnb terms of services and privacy policies, ensuring ethics in data gathering as well being lawful.

Prerequisites

To start scraping Airbnb data, we’ll use Python with key libraries that make data extraction and management more efficient. Here’s a quick overview of the tools and setup steps:

  1. Python: Ensure Python (preferably 3.10) is installed. You can download it from Python’s official site.
  1. Regex: This is another built into Python, which lets you extract targeted data by recognizing patterns within the HTML. It’s way faster, it works better for disparate structures and is easier to pick elements such as price, descriptions, location etc. It belongs to the Python standard library, so you do not have to install any additional package.
  2. Selenium-stealth: Few websites, like airbnb, put some bot protections and you can not scrape these websites directly. selenium-stealth bypasses these detections by acting with the sort of Human-like browsing behaviors.
pip install selenium-stealth
  1. Pandas: This library helps manage, analyze, and store data in a structured way (like CSV files). Install with:

    pip install pandas

    Once you have these libraries installed, you’re ready to begin data extraction. These tools will enable a more flexible and robust scraping workflow. Remember, before collecting data, always confirm that your usage aligns with Airbnb’s terms of service and guidelines.

Understanding Airbnb’s Structure for scraping 

To scrape Airbnb data, you need to know the structure of an Airbnb listing page so that you can identify important components such as property information and prices as well location coordinates. How to identify and extract this data properly Here is a guide:

Identifying Key Data Elements

Properties Details: Main characteristics such as property title, type (apartment/house), number of rooms and additional attributes. Open Airbnbs listing pages and inspect their HTML structure with the Developer Tools of your Browser (right-click > Inpect) or by pressing F12. Identify the respective classes or HTML tags that contain these details and do it consistently

Pricing: Pricing information, such as per-night rates and fees Normally prices are shown in specific tags (e.g. `<span>` or `<div>`):

Location:  Most listings include an approximate location in the form of city, neighborhood or distance to nearby landmarks instead of a precise address. You can get this information in meta tags or some descriptive field inside the page.

Review and Rating Details: Airbnb listings also provide detailed breakdowns of user ratings across different categories such as cleanliness, communication, check-in, location, accuracy, and value.

Handling Pagination

 pagination structure: In Airbnb generally at the bottom of search pages there will be a system to paginate, load more items. These pages are often paginated with something like “Next” links or even direct page numbers, so you can paginate through the data.

Automate Page Navigation : You can automate the click on “Next” button for each page using Selenium. Or else, if the page URL have known parameters for pagination (ex:page=2), then you can simply append to these parameter within your code and fetch listings in batches.

Pull Data from Multiple Listings

Iterating Over Listings: We can iterate the listings once we have located the DOM elements representing them on a page.

Storing and Structuring the Data : Store the collected data in a DataFrame using pandas allowing you to save it as CSV or analyze, table-like format.

 Keep An Eye Out For Dynamic Content Airbnb changes their website structure often so your script has to be updated as new elements or layout is used.

Fetching Data: Using the selenium-stealth Library to Get Dynamic Content from Listing Pages

When scraping Airbnb listings, a lot of the important information—like prices or availability—may not be immediately available in the static HTML. This is because it’s often loaded dynamically via JavaScript after the page fully loads. To deal with this, we can use the selenium-stealth library to automate a browser and fetch the fully loaded content.

Here’s a simple example of fetching and printing the page content:

from selenium import webdriver
from selenium_stealth import stealth
import time
import re

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")

# options.add_argument("--headless")

options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)


# Stealth setup to avoid detection
stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
        )


# Navigate to the listing page
url = "https://www.airbnb.com/s/United-States/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&monthly_start_date=2024-11-01&monthly_length=3&monthly_end_date=2025-02-01&price_filter_input_type=0&channel=EXPLORE&query=United%20States&place_id=ChIJCzYy5IS16lQRQrfeQ5K5Oxw&date_picker_type=calendar&source=structured_search_input_header&search_type=user_map_move&search_mode=regular_search&price_filter_num_nights=5&ne_lat=78.7534545389953&ne_lng=17.82560738379206&sw_lat=-36.13028852123955&sw_lng=-124.379810004604&zoom=2.613816079556603&zoom_level=2.613816079556603&search_by_map=true"

driver.get(url)


# Fetch and print the page source
html_content = driver.page_source
print(html_content)
driver.quit()

Code explanation:

  1. Chrome Setup: Configures Chrome with options to avoid detection (e.g., starting maximized, hiding automation flags).
  2. Stealth Mode: Uses selenium_stealth to mimic a human user (adjusts language, platform, renderer).
  3. Navigate & Capture: Opens the specified Airbnb URL, captures the HTML content with driver.page_source.
  4. Close Browser: Ends the session with driver.quit()

 Extracting Key Information from HTML Using Regex

Once we have the HTML content using selenium-stealth, the next step is to pass the HTML to regex for extracting important information such as details page URLs, prices, and other key details. By using regex, we can efficiently target specific patterns within the HTML without needing to parse the entire document structure.

Here’s a simple example of how to extract key information like the details page URL:

import re

# Define a regex pattern to capture all property URLs from listing pages
url_pattern = 'labelledby="[^"]+" href="(/rooms/d+[^"]+)"'

# Find all matching URLs in the HTML content
urls = re.findall(url_pattern, html_content)
print(len(urls))

 
url_list = [] #Storing all URLs in a Python list

for url in urls:
    details_page_url =  "https://www.airbnb.com"+url
    print(details_page_url) # Print extracted URLs
    url_list.append(details_page_url)

This regex pattern is used to capture Airbnb property URLs from an HTML content string.

Code Explanation

  • urls = re.findall(url_pattern, html_content): Finds all instances of URLs that match url_pattern in html_content. Each match is added to the urls list.
  • The for loop:
    • Iterates through each matched url in urls.
    • Prepends each URL with the base https://www.airbnb.com, forming a complete URL.
    • Prints each URL and appends it to url_list for later use.

Handling Pagination: Navigating Through Multiple Pages of Listings Efficiently

To make scraping more flexible, you can allow the user to input how many pages they want to scrape. This ensures the scraper clicks through the exact number of pages requested and stops automatically. Here’s how you can modify the pagination logic to accept user input for the number of pages:

from selenium import webdriver
from selenium_stealth import stealth
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import re
import pandas as pd

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
# options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)

# Stealth setup to avoid detection
stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
        )

# Function to scrape the current page and return all property URLs
def scrape_current_page():
    html_content = driver.page_source
    url_pattern = 'labelledby="[^"]+" href="(/rooms/d+[^"]+)"'
    urls = re.findall(url_pattern, html_content)
    return urls

# Function to scroll to the bottom of the page
def scroll_to_bottom():
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)  # Give time for the page to load additional content

# Function to wait for the "Next" button and click it
def go_to_next_page():
    try:
        # Wait until the "Next" button is clickable
        next_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, "a[aria-label='Next']"))
        )
        scroll_to_bottom()  # Scroll to the bottom of the page before clicking
        next_button.click()
        return True
    except Exception as e:
        print(f"Couldn't navigate to next page: {e}")
        return False

# base url
url = "https://www.airbnb.com/s/United-States/homes?flexible_trip_lengths%5B%5D=one_week&date_picker_type=flexible_dates&place_id=ChIJCzYy5IS16lQRQrfeQ5K5Oxw&refinement_paths%5B%5D=%2Fhomes&search_type=AUTOSUGGEST"
driver.get(url)

# Ask the user how many pages to scrape
num_pages = int(input("How many pages do you want to scrape? "))

url_list = []  # Storing all URLs in a Python list

# Scrape the specified number of pages
for page in range(num_pages):
    print(f"Scraping page {page + 1}...")
   
    # Scrape URLs from the current page
    urls = scrape_current_page()
    for url in urls:
        details_page_url = "https://www.airbnb.com" + url
        print(details_page_url)  # Print extracted URLs
        url_list.append(details_page_url)
   
    # Try to go to the next page
    if not go_to_next_page():
        break  # If there's no "Next" button or an error occurs, stop the loop
   
    # Wait for the next page to load
    time.sleep(3)

# After scraping is complete, print the total number of URLs
print(f"Total URLs scraped: {len(url_list)}")

code explanation:

  1. Page Navigation Loop: The code iterates through multiple pages based on the num_pages input, scraping each page’s URLs.
  2. Scrape Current Page: scrape_current_page() extracts property URLs from the HTML of the current page using regex.
  3. Scroll to Bottom: scroll_to_bottom() scrolls to the bottom of the page, ensuring any lazy-loaded content is loaded.
  4. Next Page Button: go_to_next_page() waits for the “Next” button to appear and scrolls to the bottom before clicking it. If the button is clickable, it moves to the next page; if not, it stops the loop.
  5. Repeat: This process repeats for each page until the specified num_pages is reached or no “Next” button is found.

This pagination approach allows the code to move seamlessly through multiple pages, scraping each page’s data until reaching the end.

Handling Dynamic Content: Using Selenium Stealth to Scrape Detail Pages Loaded with JavaScript

After gathering the list of detail page URLs, the next step is to loop through each URL and extract the required information. Since these detail pages load dynamic content via JavaScript, Selenium Stealth ensures the page is fully loaded before we extract the data using regex to find and pull specific information from the HTML.

Here’s how you can handle this:

  1. Loop Through URLs: Iterate through each URL from your list.
  2. Load the Page: Use Selenium Stealth to fully load each detail page.
  3. Extract Data with Regex: Use regex to extract specific data such as pricing, reviews, or descriptions.
# function to scrape information from a details page (title, price, etc.)
def scrape_details_page(url):
    try:
        driver.get(url)
        # Wait for the page to load (you can adjust this)
        time.sleep(2)
        html_content = driver.page_source
        scroll_to_bottom()
        time.sleep(2)
        # Regex pattern for scraping the title
        title_pattern = r'<h1[^>]+>([^<]+)</h1>'
   
        # Scrape the title (adjust the selector according to the page structure)
        title = re.search(title_pattern,html_content)
        if title:
           title = title.group(1)
        else:
            title = None
       
        price_pattern = r'($d+[^<]+)</span></span>[^>]+></div></div>'
        price = re.search(price_pattern,html_content)
   
        if price:
            price = price.group(1)
        else:
            price = None

        address_pattern = r'dir-ltr"><div[^>]+><section><div[^>]+ltr"><h2[^>]+>([^<]+)</h2>'
        address =  re.search(address_pattern,html_content)
        if address:
           address =  address.group(1)
        else:
            address = None
       
        guest_pattern = r'<li class="l7n4lsf[^>]+>([^<]+)<span'
        guest =   re.search(guest_pattern,html_content)
        if guest:
           guest = guest.group(1)
        else:
            guest = None
        # You can add more information to scrape (example: price, description, etc.)
       
        bed_bath_pattern = r'</span>(d+[^<]+)'
        bed_bath = re.findall(bed_bath_pattern,html_content)
        bed_bath_details = []
        if bed_bath:
            for bed_bath_info in bed_bath:
                bed_bath_details.append(bed_bath_info.strip())
       
        reviews_pattern = r'l1nqfsv9[^>]+>([^<]+)</div>[^>]+>(d+[^<]+)</div>'
        reviews_details =  re.findall(reviews_pattern,html_content)
        review_list = []
        if reviews_details:
               for review in reviews_details:
                    attribute, rating = review  # Unpack the attribute and rating
                    review_list.append(f'{attribute} {rating}')  # Combine into a readable format


        host_name_pattern = r't1gpcl1t atm_w4_16rzvi6 atm_9s_1o8liyq atm_gi_idpfg4 dir dir-ltr[^>]+>([^<]+)'
        host_name =  re.search(host_name_pattern,html_content)
        if host_name:
           host_name = host_name.group(1)    
        else:
            host_name = None

        total_review_pattern = r'pdp-reviews-[^>]+>[^>]+>(d+[^<]+)</span>'
        total_review =  re.search(total_review_pattern,html_content)
        if total_review:
           total_review =  total_review.group(1)    
        else:
            total_review = None


        host_info_pattern = r'd1u64sg5[^"]+atm_67_1vlbu9m dir dir-ltr[^>]+><div><span[^>]+>([^<]+)'
        host_info = re.findall(host_info_pattern,html_content)
        host_info_list = []
        if host_info:
            for host_info_details in host_info:
                 host_info_list.append(host_info_details)
       
        # Print the scraped information (for debugging purposes)
        print(f"Title: {title}n Price:{price}n Address: {address}n Guest: {guest}n bed_bath_details:{bed_bath_details}n Reviews: {review_list}n Host_name: {host_name}n total_review: {total_review}n Host Info: {host_info_list}n ")
       
        # Return the information as a dictionary (or adjust based on your needs)
          # Store the scraped information in a dictionary
        return {
            "url": url,
            "Title": title,
            "Price": price,
            "Address": address,
            "Guest": guest,
            "Bed_Bath_Details": bed_bath_details,
            "Reviews": review_list,
            "Host_Name": host_name,
            "Total_Reviews": total_review,
            "Host_Info": host_info
        }
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None


# Scrape the details page for each URL stored in the url_list  
for url in url_list:
    print(f"Scraping details from: {url}")
    data = scrape_details_page(url)

In this approach, we load each detail page using Selenium Stealth to ensure dynamic JavaScript content is fully loaded, and then use regex to extract specific data directly from the HTML content. This method is efficient and flexible for scraping pages with dynamically loaded content.

Saving the Data: Storing Scraped Data Using Pandas in CSV/Excel Formats

After scraping and extracting data from the Detail pages now its time to store that data in an effective way. Pandas makes it very easy to save the information extract in popular formats like CSV or Excel (popular for data analysis/sharing).

This is how you can do it:

  Step 1: Create a DataFrame After fetching data, we need to place it in Pandas DataFrame.

 Step 2: Save to CSV/Excel : Save the Dataframe as a csv or excel file with pandas functions.

Example:

import pandas as pd

# Function to save data to CSV using pandas
def save_to_csv(data, filename='airbnb_data.csv'):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")




scraped_data = []


# Scrape the details page for each URL stored in the url_list  
for url in url_list:
    print(f"Scraping details from: {url}")
    data = scrape_details_page(url)
    if data:
        scraped_data.append(data)
     

# After scraping, save data to CSV
if scraped_data:
    save_to_csv(scraped_data)
else:
    print("No data to save.")

By using Pandas, you can easily manage and store the scraped data in structured formats, making it simple to use for further analysis or sharing with others. Here is the screenshot of csv result:

air bnb

here is full code:

from selenium import webdriver
from selenium_stealth import stealth
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import re
import pandas as pd

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
# options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)

# Stealth setup to avoid detection
stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
        )

# Function to scrape the current page and return all property URLs
def scrape_current_page():
    html_content = driver.page_source
    url_pattern = 'labelledby="[^"]+" href="(/rooms/d+[^"]+)"'
    urls = re.findall(url_pattern, html_content)
    return urls

# Function to scroll to the bottom of the page
def scroll_to_bottom():
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)  # Give time for the page to load additional content

# Function to wait for the "Next" button and click it
def go_to_next_page():
    try:
        # Wait until the "Next" button is clickable
        next_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, "a[aria-label='Next']"))
        )
        scroll_to_bottom()  # Scroll to the bottom of the page before clicking
        next_button.click()
        return True
    except Exception as e:
        print(f"Couldn't navigate to next page: {e}")
        return False

# base url
url = "https://www.airbnb.com/s/United-States/homes?flexible_trip_lengths%5B%5D=one_week&date_picker_type=flexible_dates&place_id=ChIJCzYy5IS16lQRQrfeQ5K5Oxw&refinement_paths%5B%5D=%2Fhomes&search_type=AUTOSUGGEST"
driver.get(url)

# Ask the user how many pages to scrape
num_pages = int(input("How many pages do you want to scrape? "))

url_list = []  # Storing all URLs in a Python list

# Scrape the specified number of pages
for page in range(num_pages):
    print(f"Scraping page {page + 1}...")
    
    # Scrape URLs from the current page
    urls = scrape_current_page()
    for url in urls:
        details_page_url = "https://www.airbnb.com" + url
        print(details_page_url)  # Print extracted URLs
        url_list.append(details_page_url)
    
    # Try to go to the next page
    if not go_to_next_page():
        break  # If there's no "Next" button or an error occurs, stop the loop
    
    # Wait for the next page to load
    time.sleep(3)

# After scraping is complete, print the total number of URLs
print(f"Total URLs scraped: {len(url_list)}")





# function to scrape information from a details page (title, price, etc.)
def scrape_details_page(url):
    try:
        driver.get(url)
        # Wait for the page to load (you can adjust this)
        time.sleep(2)
        html_content = driver.page_source
        scroll_to_bottom()
        time.sleep(2) 
        # Regex pattern for scraping the title
        title_pattern = r'<h1[^>]+>([^<]+)</h1>'
    
        # Scrape the title (adjust the selector according to the page structure)
        title = re.search(title_pattern,html_content)
        if title:
           title = title.group(1)
        else:
            title = None
        
        price_pattern = r'($d+[^<]+)</span></span>[^>]+></div></div>'
        price = re.search(price_pattern,html_content)
    
        if price:
            price = price.group(1)
        else:
            price = None

        address_pattern = r'dir-ltr"><div[^>]+><section><div[^>]+ltr"><h2[^>]+>([^<]+)</h2>'
        address =  re.search(address_pattern,html_content)
        if address:
           address =  address.group(1)
        else:
            address = None
        
        guest_pattern = r'<li class="l7n4lsf[^>]+>([^<]+)<span'
        guest =   re.search(guest_pattern,html_content)
        if guest:
           guest = guest.group(1)
        else:
            guest = None
        # You can add more information to scrape (example: price, description, etc.)
        
        bed_bath_pattern = r'</span>(d+[^<]+)'
        bed_bath = re.findall(bed_bath_pattern,html_content)
        bed_bath_details = [] 
        if bed_bath:
            for bed_bath_info in bed_bath:
                bed_bath_details.append(bed_bath_info.strip())
        
        reviews_pattern = r'l1nqfsv9[^>]+>([^<]+)</div>[^>]+>(d+[^<]+)</div>'
        reviews_details =  re.findall(reviews_pattern,html_content)
        review_list = []
        if reviews_details:
               for review in reviews_details:
                    attribute, rating = review  # Unpack the attribute and rating
                    review_list.append(f'{attribute} {rating}')  # Combine into a readable format


        host_name_pattern = r't1gpcl1t atm_w4_16rzvi6 atm_9s_1o8liyq atm_gi_idpfg4 dir dir-ltr[^>]+>([^<]+)'
        host_name =  re.search(host_name_pattern,html_content)
        if host_name:
           host_name = host_name.group(1)    
        else:
            host_name = None

        total_review_pattern = r'pdp-reviews-[^>]+>[^>]+>(d+[^<]+)</span>'
        total_review =  re.search(total_review_pattern,html_content)
        if total_review:
           total_review =  total_review.group(1)    
        else:
            total_review = None


        host_info_pattern = r'd1u64sg5[^"]+atm_67_1vlbu9m dir dir-ltr[^>]+><div><span[^>]+>([^<]+)'
        host_info = re.findall(host_info_pattern,html_content)
        host_info_list = []
        if host_info:
            for host_info_details in host_info:
                 host_info_list.append(host_info_details)
        
        # Print the scraped information (for debugging purposes)
        print(f"Title: {title}n Price:{price}n Address: {address}n Guest: {guest}n bed_bath_details:{bed_bath_details}n Reviews: {review_list}n Host_name: {host_name}n total_review: {total_review}n Host Info: {host_info_list}n ")
        
        # Return the information as a dictionary (or adjust based on your needs)
          # Store the scraped information in a dictionary
        return {
            "url": url,
            "Title": title,
            "Price": price,
            "Address": address,
            "Guest": guest,
            "Bed_Bath_Details": bed_bath_details,
            "Reviews": review_list,
            "Host_Name": host_name,
            "Total_Reviews": total_review,
            "Host_Info": host_info
        }
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None


# Function to save data to CSV using pandas
def save_to_csv(data, filename='airbnb_data.csv'):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")



scraped_data = []


# Scrape the details page for each URL stored in the url_list  
for url in url_list:
    print(f"Scraping details from: {url}")
    data = scrape_details_page(url)
    if data:
        scraped_data.append(data)
     

# After scraping, save data to CSV
if scraped_data:
    save_to_csv(scraped_data)
else:
    print("No data to save.")

# Close the browser
driver.quit()
    





Avoiding Blocks: Techniques to Prevent Getting Blocked

Especially when scraping websites such as the Airbnb website which has a high traffic from single IP causing to get blocking or getting detected as bot. Well, such things can be stopped using proxies. Proxies help because they hide your true IP address, so it looks like the requests are coming from different places.

Why Use Proxies?

Without a proxy, you can get rate limited or completely blocked for sending too many requests from the same IP address.  To escape this you can use a proxy which hides your IP.

Why Use Rotating Proxies?

Preventing IP Bans: Rotating proxies change the current IP address with each request (or every few requests) to decrease your scraper from being detected and thereby banned.

How to bypass Captchas: Sometimes, the website will see a lot of traffic coming from one IP address and place captchas on it. Rotating proxies will spread requests across numerous IPs, helping to avoid captchas.

I have given an example of spinning proxies with Rayobyte in this tutorial, but you can use any other reliable proxy provider that supports rotation. Rotating proxies reduce the likelihood of captchas and IP bans, allowing you to scrape faster for longer periods.

Example of using proxy:

import pandas as pd
import re
import time
import random
from selenium import webdriver
from selenium_stealth import stealth
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


# Function to create proxy authentication extension
def create_proxy_auth_extension(proxy_host, proxy_user, proxy_pass):
    import zipfile
    import os

    # Separate the host and port
    host = proxy_host.split(':')[0]
    port = proxy_host.split(':')[1]

    # Define proxy extension files
    manifest_json = """
    {
        "version": "1.0.0",
        "manifest_version": 2,
        "name": "Chrome Proxy",
        "permissions": [
            "proxy",
            "tabs",
            "unlimitedStorage",
            "storage",
            "<all_urls>",
            "webRequest",
            "webRequestBlocking"
        ],
        "background": {
            "scripts": ["background.js"]
        },
        "minimum_chrome_version":"22.0.0"
    }
    """
    
    background_js = f"""
    var config = {{
            mode: "fixed_servers",
            rules: {{
              singleProxy: {{
                scheme: "http",
                host: "{host}",
                port: parseInt({port})
              }},
              bypassList: ["localhost"]
            }}
          }};
    chrome.proxy.settings.set({{value: config, scope: "regular"}}, function() {{}});

    chrome.webRequest.onAuthRequired.addListener(
        function(details) {{
            return {{
                authCredentials: {{
                    username: "{proxy_user}",
                    password: "{proxy_pass}"
                }}
            }};
        }},
        {{urls: ["<all_urls>"]}},
        ["blocking"]
    );
    """

    # Create the extension
    pluginfile = 'proxy_auth_plugin.zip'
    with zipfile.ZipFile(pluginfile, 'w') as zp:
        zp.writestr("manifest.json", manifest_json)
        zp.writestr("background.js", background_js)

    return pluginfile


# Function to configure and return the WebDriver with proxy
def init_driver_with_proxy(proxy_server, proxy_username, proxy_password):
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")

    # Add proxy authentication if necessary
    if proxy_username and proxy_password:
        options.add_extension(create_proxy_auth_extension(proxy_server, proxy_username, proxy_password))

    # Stealth mode to avoid detection
    driver = webdriver.Chrome(options=options)
    stealth(driver,
            languages=["en-US", "en"],
            vendor="Google Inc.",
            platform="Win32",
            webgl_vendor="Intel Inc.",
            renderer="Intel Iris OpenGL Engine",
            fix_hairline=True,
            )
    return driver


# Proxy pool for rotation (list of proxy servers)
proxy_pool = [
    {"proxy": "proxy1.com:8000", "username": "user1", "password": "pass1"},
    {"proxy": "proxy2.com:8000", "username": "user2", "password": "pass2"},
    {"proxy": "proxy3.com:8000", "username": "user3", "password": "pass3"}
   
]

# Function to scrape details page (rotate proxy on each request)
def scrape_details_page(url):
    try:
        # Rotate proxy by choosing a random one from the pool
        proxy = random.choice(proxy_pool)
        driver = init_driver_with_proxy(proxy['proxy'], proxy['username'], proxy['password'])

        driver.get(url)
        time.sleep(3)  # Wait for the page to load

        html_content = driver.page_source

        # Regex pattern for scraping the title
        title_pattern = r'<h1[^>]+>([^<]+)</h1>'
    
        # Scrape the title  
        title = re.search(title_pattern,html_content)
        if title:
           title = title.group(1)
        else:
            title = None

        # Scrape the price  
        price_pattern = r'($d+[^<]+)</span></span>[^>]+></div></div>'
        price = re.search(price_pattern,html_content)
    
        if price:
            price = price.group(1)
        else:
            price = None
        
        # Scrape the address  
        address_pattern = r'dir-ltr"><div[^>]+><section><div[^>]+ltr"><h2[^>]+>([^<]+)</h2>'
        address =  re.search(address_pattern,html_content)
        if address:
           address =  address.group(1)
        else:
            address = None

        # Scrape the guest  
        guest_pattern = r'<li class="l7n4lsf[^>]+>([^<]+)<span'
        guest =   re.search(guest_pattern,html_content)
        if guest:
           guest = guest.group(1)
        else:
            guest = None
        # You can add more information to scrape (example: price, description, etc.)
        
        # Scrape the bedrooms, bed, bath  details  
        bed_bath_pattern = r'</span>(d+[^<]+)'
        bed_bath = re.findall(bed_bath_pattern,html_content)
        bed_bath_details = [] 
        if bed_bath:
            for bed_bath_info in bed_bath:
                bed_bath_details.append(bed_bath_info.strip())
       
        #scrape reviews such as Cleanliness, Accuracy, Communication etc.
        reviews_pattern = r'l1nqfsv9[^>]+>([^<]+)</div>[^>]+>(d+[^<]+)</div>'
        reviews_details =  re.findall(reviews_pattern,html_content)
        review_list = []
        if reviews_details:
               for review in reviews_details:
                    attribute, rating = review  # Unpack the attribute and rating
                    review_list.append(f'{attribute} {rating}')  # Combine into a readable format

        #scrape host name
        host_name_pattern = r't1gpcl1t atm_w4_16rzvi6 atm_9s_1o8liyq atm_gi_idpfg4 dir dir-ltr[^>]+>([^<]+)'
        host_name =  re.search(host_name_pattern,html_content)
        if host_name:
           host_name = host_name.group(1)    
        else:
            host_name = None

        #scrape total number of review
        total_review_pattern = r'pdp-reviews-[^>]+>[^>]+>(d+[^<]+)</span>'
        total_review =  re.search(total_review_pattern,html_content)
        if total_review:
           total_review =  total_review.group(1)    
        else:
            total_review = None

        #scrape host info
        host_info_pattern = r'd1u64sg5[^"]+atm_67_1vlbu9m dir dir-ltr[^>]+><div><span[^>]+>([^<]+)'
        host_info = re.findall(host_info_pattern,html_content)
        host_info_list = []
        if host_info:
            for host_info_details in host_info:
                 host_info_list.append(host_info_details)
        
        # Print the scraped information (for debugging purposes)
        print(f"Title: {title}n Price:{price}n Address: {address}n Guest: {guest}n bed_bath_details:{bed_bath_details}n Reviews: {review_list}n Host_name: {host_name}n total_review: {total_review}n Host Info: {host_info_list}n ")
        
        # Return the information as a dictionary (or adjust based on your needs)
        # Store the scraped information in a dictionary
        return {
            "url": url,
            "Title": title,
            "Price": price,
            "Address": address,
            "Guest": guest,
            "Bed_Bath_Details": bed_bath_details,
            "Reviews": review_list,
            "Host_Name": host_name,
            "Total_Reviews": total_review,
            "Host_Info": host_info
        }
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None


# Function to save data to CSV using pandas
def save_to_csv(data, filename='airbnb_data.csv'):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")


# List of URLs to scrape
url_list = ["https://www.airbnb.com/rooms/968367851365040114?adults=1&category_tag=Tag%3A8148&children=0&enable_m3_private_room=true&infants=0&pets=0&photo_id=1750644422&search_mode=regular_search&check_in=2025-01-18&check_out=2025-01-23&source_impression_id=p3_1729605408_P3X7GT0Ec98R7_ET&previous_page_section_name=1000&federated_search_id=62850efb-a8ab-4062-92ec-e9010fc6a24f"]  # Replace with actual URLs
scraped_data = []

# Scrape the details page for each URL with proxy rotation
for url in url_list:
    print(f"Scraping details from: {url}")
    data = scrape_details_page(url)
    if data:
        scraped_data.append(data)

# After scraping, save data to CSV
if scraped_data:
    save_to_csv(scraped_data)
else:
    print("No data to save.")

 Legal and Ethical Issues

To prevent any potential problems, you must also scrape the data legally and ethically from websites. In every case you need to take a look at the Terms of Service from the website.

Avoid overloading the servers of the website with a lot of requests in a short amount. Use in moderation and honor timeouts to reduce the impact of disruption. No scraping should degrade the performance of a website and steal proprietary or sensitive data.  

Download the full code from GitHub.

watch full tutorial on YouTube 

Responses

Related Projects

google shopping scraper python
yahoo search
Bing search 1
b9929b09 167f 4365 9087 fddf3278a679