Airbnb Web Scraping with Python: Extract Listings and Pricing Data
Download the full code from GitHub.
Table of content
- Introduction
- Prerequisites
- Understanding Airbnb’s Structure for scraping
- Get Dynamic Content from Listing Pages
- Using Regex
- Handling Pagination
- Scrape Detail Page
- Saving data to csv
- Techniques to Prevent Getting Blocked
- Legal and Ethical Issues
Introduction
By scraping Airbnb, businesses and researchers can find out more about rental trends and consumer preferences as well as how the pricing dynamic works. This reports rich data has immeasurable benefits for competitive analysis, location-based investment decisions and understanding seasonal demand shifts. But only if it can be done legally by following Airbnb terms of services and privacy policies, ensuring ethics in data gathering as well being lawful.
Prerequisites
To start scraping Airbnb data, we’ll use Python with key libraries that make data extraction and management more efficient. Here’s a quick overview of the tools and setup steps:
- Python: Ensure Python (preferably 3.10) is installed. You can download it from Python’s official site.
- Regex: This is another built into Python, which lets you extract targeted data by recognizing patterns within the HTML. It’s way faster, it works better for disparate structures and is easier to pick elements such as price, descriptions, location etc. It belongs to the Python standard library, so you do not have to install any additional package.
- Selenium-stealth: Few websites, like airbnb, put some bot protections and you can not scrape these websites directly. selenium-stealth bypasses these detections by acting with the sort of Human-like browsing behaviors.
pip install selenium-stealth
- Pandas: This library helps manage, analyze, and store data in a structured way (like CSV files). Install with:
pip install pandas
Once you have these libraries installed, you’re ready to begin data extraction. These tools will enable a more flexible and robust scraping workflow. Remember, before collecting data, always confirm that your usage aligns with Airbnb’s terms of service and guidelines.
Understanding Airbnb’s Structure for scraping
To scrape Airbnb data, you need to know the structure of an Airbnb listing page so that you can identify important components such as property information and prices as well location coordinates. How to identify and extract this data properly Here is a guide:
Identifying Key Data Elements
Properties Details: Main characteristics such as property title, type (apartment/house), number of rooms and additional attributes. Open Airbnbs listing pages and inspect their HTML structure with the Developer Tools of your Browser (right-click > Inpect) or by pressing F12. Identify the respective classes or HTML tags that contain these details and do it consistently
Pricing: Pricing information, such as per-night rates and fees Normally prices are shown in specific tags (e.g. `<span>` or `<div>`):
Location: Most listings include an approximate location in the form of city, neighborhood or distance to nearby landmarks instead of a precise address. You can get this information in meta tags or some descriptive field inside the page.
Review and Rating Details: Airbnb listings also provide detailed breakdowns of user ratings across different categories such as cleanliness, communication, check-in, location, accuracy, and value.
Handling Pagination
pagination structure: In Airbnb generally at the bottom of search pages there will be a system to paginate, load more items. These pages are often paginated with something like “Next” links or even direct page numbers, so you can paginate through the data.
Automate Page Navigation : You can automate the click on “Next” button for each page using Selenium. Or else, if the page URL have known parameters for pagination (ex:page=2), then you can simply append to these parameter within your code and fetch listings in batches.
Pull Data from Multiple Listings
Iterating Over Listings: We can iterate the listings once we have located the DOM elements representing them on a page.
Storing and Structuring the Data : Store the collected data in a DataFrame using pandas allowing you to save it as CSV or analyze, table-like format.
Keep An Eye Out For Dynamic Content Airbnb changes their website structure often so your script has to be updated as new elements or layout is used.
Fetching Data: Using the selenium-stealth Library to Get Dynamic Content from Listing Pages
When scraping Airbnb listings, a lot of the important information—like prices or availability—may not be immediately available in the static HTML. This is because it’s often loaded dynamically via JavaScript after the page fully loads. To deal with this, we can use the selenium-stealth library to automate a browser and fetch the fully loaded content.
Here’s a simple example of fetching and printing the page content:
from selenium import webdriver from selenium_stealth import stealth import time import re options = webdriver.ChromeOptions() options.add_argument("start-maximized") # options.add_argument("--headless") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(options=options) # Stealth setup to avoid detection stealth(driver, languages=["en-US", "en"], vendor="Google Inc.", platform="Win32", webgl_vendor="Intel Inc.", renderer="Intel Iris OpenGL Engine", fix_hairline=True, ) # Navigate to the listing page url = "https://www.airbnb.com/s/United-States/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&monthly_start_date=2024-11-01&monthly_length=3&monthly_end_date=2025-02-01&price_filter_input_type=0&channel=EXPLORE&query=United%20States&place_id=ChIJCzYy5IS16lQRQrfeQ5K5Oxw&date_picker_type=calendar&source=structured_search_input_header&search_type=user_map_move&search_mode=regular_search&price_filter_num_nights=5&ne_lat=78.7534545389953&ne_lng=17.82560738379206&sw_lat=-36.13028852123955&sw_lng=-124.379810004604&zoom=2.613816079556603&zoom_level=2.613816079556603&search_by_map=true" driver.get(url) # Fetch and print the page source html_content = driver.page_source print(html_content) driver.quit()
Code explanation:
- Chrome Setup: Configures Chrome with options to avoid detection (e.g., starting maximized, hiding automation flags).
- Stealth Mode: Uses
selenium_stealth
to mimic a human user (adjusts language, platform, renderer). - Navigate & Capture: Opens the specified Airbnb URL, captures the HTML content with
driver.page_source
. - Close Browser: Ends the session with
driver.quit()
Extracting Key Information from HTML Using Regex
Once we have the HTML content using selenium-stealth, the next step is to pass the HTML to regex for extracting important information such as details page URLs, prices, and other key details. By using regex, we can efficiently target specific patterns within the HTML without needing to parse the entire document structure.
Here’s a simple example of how to extract key information like the details page URL:
import re # Define a regex pattern to capture all property URLs from listing pages url_pattern = 'labelledby="[^"]+" href="(/rooms/d+[^"]+)"' # Find all matching URLs in the HTML content urls = re.findall(url_pattern, html_content) print(len(urls)) url_list = [] #Storing all URLs in a Python list for url in urls: details_page_url = "https://www.airbnb.com"+url print(details_page_url) # Print extracted URLs url_list.append(details_page_url)
This regex pattern is used to capture Airbnb property URLs from an HTML content string.
Code Explanation
urls = re.findall(url_pattern, html_content)
: Finds all instances of URLs that matchurl_pattern
inhtml_content
. Each match is added to theurls
list.- The
for
loop:- Iterates through each matched
url
inurls
. - Prepends each URL with the base
https://www.airbnb.com
, forming a complete URL. - Prints each URL and appends it to
url_list
for later use.
- Iterates through each matched
Handling Pagination: Navigating Through Multiple Pages of Listings Efficiently
To make scraping more flexible, you can allow the user to input how many pages they want to scrape. This ensures the scraper clicks through the exact number of pages requested and stops automatically. Here’s how you can modify the pagination logic to accept user input for the number of pages:
from selenium import webdriver from selenium_stealth import stealth from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import time import re import pandas as pd options = webdriver.ChromeOptions() options.add_argument("start-maximized") # options.add_argument("--headless") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(options=options) # Stealth setup to avoid detection stealth(driver, languages=["en-US", "en"], vendor="Google Inc.", platform="Win32", webgl_vendor="Intel Inc.", renderer="Intel Iris OpenGL Engine", fix_hairline=True, ) # Function to scrape the current page and return all property URLs def scrape_current_page(): html_content = driver.page_source url_pattern = 'labelledby="[^"]+" href="(/rooms/d+[^"]+)"' urls = re.findall(url_pattern, html_content) return urls # Function to scroll to the bottom of the page def scroll_to_bottom(): driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(2) # Give time for the page to load additional content # Function to wait for the "Next" button and click it def go_to_next_page(): try: # Wait until the "Next" button is clickable next_button = WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.CSS_SELECTOR, "a[aria-label='Next']")) ) scroll_to_bottom() # Scroll to the bottom of the page before clicking next_button.click() return True except Exception as e: print(f"Couldn't navigate to next page: {e}") return False # base url url = "https://www.airbnb.com/s/United-States/homes?flexible_trip_lengths%5B%5D=one_week&date_picker_type=flexible_dates&place_id=ChIJCzYy5IS16lQRQrfeQ5K5Oxw&refinement_paths%5B%5D=%2Fhomes&search_type=AUTOSUGGEST" driver.get(url) # Ask the user how many pages to scrape num_pages = int(input("How many pages do you want to scrape? ")) url_list = [] # Storing all URLs in a Python list # Scrape the specified number of pages for page in range(num_pages): print(f"Scraping page {page + 1}...") # Scrape URLs from the current page urls = scrape_current_page() for url in urls: details_page_url = "https://www.airbnb.com" + url print(details_page_url) # Print extracted URLs url_list.append(details_page_url) # Try to go to the next page if not go_to_next_page(): break # If there's no "Next" button or an error occurs, stop the loop # Wait for the next page to load time.sleep(3) # After scraping is complete, print the total number of URLs print(f"Total URLs scraped: {len(url_list)}")
code explanation:
- Page Navigation Loop: The code iterates through multiple pages based on the
num_pages
input, scraping each page’s URLs. - Scrape Current Page:
scrape_current_page()
extracts property URLs from the HTML of the current page using regex. - Scroll to Bottom:
scroll_to_bottom()
scrolls to the bottom of the page, ensuring any lazy-loaded content is loaded. - Next Page Button:
go_to_next_page()
waits for the “Next” button to appear and scrolls to the bottom before clicking it. If the button is clickable, it moves to the next page; if not, it stops the loop. - Repeat: This process repeats for each page until the specified
num_pages
is reached or no “Next” button is found.
This pagination approach allows the code to move seamlessly through multiple pages, scraping each page’s data until reaching the end.
Handling Dynamic Content: Using Selenium Stealth to Scrape Detail Pages Loaded with JavaScript
After gathering the list of detail page URLs, the next step is to loop through each URL and extract the required information. Since these detail pages load dynamic content via JavaScript, Selenium Stealth ensures the page is fully loaded before we extract the data using regex to find and pull specific information from the HTML.
Here’s how you can handle this:
- Loop Through URLs: Iterate through each URL from your list.
- Load the Page: Use Selenium Stealth to fully load each detail page.
- Extract Data with Regex: Use regex to extract specific data such as pricing, reviews, or descriptions.
# function to scrape information from a details page (title, price, etc.) def scrape_details_page(url): try: driver.get(url) # Wait for the page to load (you can adjust this) time.sleep(2) html_content = driver.page_source scroll_to_bottom() time.sleep(2) # Regex pattern for scraping the title title_pattern = r'<h1[^>]+>([^<]+)</h1>' # Scrape the title (adjust the selector according to the page structure) title = re.search(title_pattern,html_content) if title: title = title.group(1) else: title = None price_pattern = r'($d+[^<]+)</span></span>[^>]+></div></div>' price = re.search(price_pattern,html_content) if price: price = price.group(1) else: price = None address_pattern = r'dir-ltr"><div[^>]+><section><div[^>]+ltr"><h2[^>]+>([^<]+)</h2>' address = re.search(address_pattern,html_content) if address: address = address.group(1) else: address = None guest_pattern = r'<li class="l7n4lsf[^>]+>([^<]+)<span' guest = re.search(guest_pattern,html_content) if guest: guest = guest.group(1) else: guest = None # You can add more information to scrape (example: price, description, etc.) bed_bath_pattern = r'</span>(d+[^<]+)' bed_bath = re.findall(bed_bath_pattern,html_content) bed_bath_details = [] if bed_bath: for bed_bath_info in bed_bath: bed_bath_details.append(bed_bath_info.strip()) reviews_pattern = r'l1nqfsv9[^>]+>([^<]+)</div>[^>]+>(d+[^<]+)</div>' reviews_details = re.findall(reviews_pattern,html_content) review_list = [] if reviews_details: for review in reviews_details: attribute, rating = review # Unpack the attribute and rating review_list.append(f'{attribute} {rating}') # Combine into a readable format host_name_pattern = r't1gpcl1t atm_w4_16rzvi6 atm_9s_1o8liyq atm_gi_idpfg4 dir dir-ltr[^>]+>([^<]+)' host_name = re.search(host_name_pattern,html_content) if host_name: host_name = host_name.group(1) else: host_name = None total_review_pattern = r'pdp-reviews-[^>]+>[^>]+>(d+[^<]+)</span>' total_review = re.search(total_review_pattern,html_content) if total_review: total_review = total_review.group(1) else: total_review = None host_info_pattern = r'd1u64sg5[^"]+atm_67_1vlbu9m dir dir-ltr[^>]+><div><span[^>]+>([^<]+)' host_info = re.findall(host_info_pattern,html_content) host_info_list = [] if host_info: for host_info_details in host_info: host_info_list.append(host_info_details) # Print the scraped information (for debugging purposes) print(f"Title: {title}n Price:{price}n Address: {address}n Guest: {guest}n bed_bath_details:{bed_bath_details}n Reviews: {review_list}n Host_name: {host_name}n total_review: {total_review}n Host Info: {host_info_list}n ") # Return the information as a dictionary (or adjust based on your needs) # Store the scraped information in a dictionary return { "url": url, "Title": title, "Price": price, "Address": address, "Guest": guest, "Bed_Bath_Details": bed_bath_details, "Reviews": review_list, "Host_Name": host_name, "Total_Reviews": total_review, "Host_Info": host_info } except Exception as e: print(f"Error scraping {url}: {e}") return None # Scrape the details page for each URL stored in the url_list for url in url_list: print(f"Scraping details from: {url}") data = scrape_details_page(url)
In this approach, we load each detail page using Selenium Stealth to ensure dynamic JavaScript content is fully loaded, and then use regex to extract specific data directly from the HTML content. This method is efficient and flexible for scraping pages with dynamically loaded content.
Saving the Data: Storing Scraped Data Using Pandas in CSV/Excel Formats
After scraping and extracting data from the Detail pages now its time to store that data in an effective way. Pandas makes it very easy to save the information extract in popular formats like CSV or Excel (popular for data analysis/sharing).
This is how you can do it:
Step 1: Create a DataFrame After fetching data, we need to place it in Pandas DataFrame.
Step 2: Save to CSV/Excel : Save the Dataframe as a csv or excel file with pandas functions.
Example:
import pandas as pd # Function to save data to CSV using pandas def save_to_csv(data, filename='airbnb_data.csv'): df = pd.DataFrame(data) df.to_csv(filename, index=False) print(f"Data saved to {filename}") scraped_data = [] # Scrape the details page for each URL stored in the url_list for url in url_list: print(f"Scraping details from: {url}") data = scrape_details_page(url) if data: scraped_data.append(data) # After scraping, save data to CSV if scraped_data: save_to_csv(scraped_data) else: print("No data to save.")
By using Pandas, you can easily manage and store the scraped data in structured formats, making it simple to use for further analysis or sharing with others. Here is the screenshot of csv result:
here is full code:
from selenium import webdriver from selenium_stealth import stealth from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import time import re import pandas as pd options = webdriver.ChromeOptions() options.add_argument("start-maximized") # options.add_argument("--headless") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(options=options) # Stealth setup to avoid detection stealth(driver, languages=["en-US", "en"], vendor="Google Inc.", platform="Win32", webgl_vendor="Intel Inc.", renderer="Intel Iris OpenGL Engine", fix_hairline=True, ) # Function to scrape the current page and return all property URLs def scrape_current_page(): html_content = driver.page_source url_pattern = 'labelledby="[^"]+" href="(/rooms/d+[^"]+)"' urls = re.findall(url_pattern, html_content) return urls # Function to scroll to the bottom of the page def scroll_to_bottom(): driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(2) # Give time for the page to load additional content # Function to wait for the "Next" button and click it def go_to_next_page(): try: # Wait until the "Next" button is clickable next_button = WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.CSS_SELECTOR, "a[aria-label='Next']")) ) scroll_to_bottom() # Scroll to the bottom of the page before clicking next_button.click() return True except Exception as e: print(f"Couldn't navigate to next page: {e}") return False # base url url = "https://www.airbnb.com/s/United-States/homes?flexible_trip_lengths%5B%5D=one_week&date_picker_type=flexible_dates&place_id=ChIJCzYy5IS16lQRQrfeQ5K5Oxw&refinement_paths%5B%5D=%2Fhomes&search_type=AUTOSUGGEST" driver.get(url) # Ask the user how many pages to scrape num_pages = int(input("How many pages do you want to scrape? ")) url_list = [] # Storing all URLs in a Python list # Scrape the specified number of pages for page in range(num_pages): print(f"Scraping page {page + 1}...") # Scrape URLs from the current page urls = scrape_current_page() for url in urls: details_page_url = "https://www.airbnb.com" + url print(details_page_url) # Print extracted URLs url_list.append(details_page_url) # Try to go to the next page if not go_to_next_page(): break # If there's no "Next" button or an error occurs, stop the loop # Wait for the next page to load time.sleep(3) # After scraping is complete, print the total number of URLs print(f"Total URLs scraped: {len(url_list)}") # function to scrape information from a details page (title, price, etc.) def scrape_details_page(url): try: driver.get(url) # Wait for the page to load (you can adjust this) time.sleep(2) html_content = driver.page_source scroll_to_bottom() time.sleep(2) # Regex pattern for scraping the title title_pattern = r'<h1[^>]+>([^<]+)</h1>' # Scrape the title (adjust the selector according to the page structure) title = re.search(title_pattern,html_content) if title: title = title.group(1) else: title = None price_pattern = r'($d+[^<]+)</span></span>[^>]+></div></div>' price = re.search(price_pattern,html_content) if price: price = price.group(1) else: price = None address_pattern = r'dir-ltr"><div[^>]+><section><div[^>]+ltr"><h2[^>]+>([^<]+)</h2>' address = re.search(address_pattern,html_content) if address: address = address.group(1) else: address = None guest_pattern = r'<li class="l7n4lsf[^>]+>([^<]+)<span' guest = re.search(guest_pattern,html_content) if guest: guest = guest.group(1) else: guest = None # You can add more information to scrape (example: price, description, etc.) bed_bath_pattern = r'</span>(d+[^<]+)' bed_bath = re.findall(bed_bath_pattern,html_content) bed_bath_details = [] if bed_bath: for bed_bath_info in bed_bath: bed_bath_details.append(bed_bath_info.strip()) reviews_pattern = r'l1nqfsv9[^>]+>([^<]+)</div>[^>]+>(d+[^<]+)</div>' reviews_details = re.findall(reviews_pattern,html_content) review_list = [] if reviews_details: for review in reviews_details: attribute, rating = review # Unpack the attribute and rating review_list.append(f'{attribute} {rating}') # Combine into a readable format host_name_pattern = r't1gpcl1t atm_w4_16rzvi6 atm_9s_1o8liyq atm_gi_idpfg4 dir dir-ltr[^>]+>([^<]+)' host_name = re.search(host_name_pattern,html_content) if host_name: host_name = host_name.group(1) else: host_name = None total_review_pattern = r'pdp-reviews-[^>]+>[^>]+>(d+[^<]+)</span>' total_review = re.search(total_review_pattern,html_content) if total_review: total_review = total_review.group(1) else: total_review = None host_info_pattern = r'd1u64sg5[^"]+atm_67_1vlbu9m dir dir-ltr[^>]+><div><span[^>]+>([^<]+)' host_info = re.findall(host_info_pattern,html_content) host_info_list = [] if host_info: for host_info_details in host_info: host_info_list.append(host_info_details) # Print the scraped information (for debugging purposes) print(f"Title: {title}n Price:{price}n Address: {address}n Guest: {guest}n bed_bath_details:{bed_bath_details}n Reviews: {review_list}n Host_name: {host_name}n total_review: {total_review}n Host Info: {host_info_list}n ") # Return the information as a dictionary (or adjust based on your needs) # Store the scraped information in a dictionary return { "url": url, "Title": title, "Price": price, "Address": address, "Guest": guest, "Bed_Bath_Details": bed_bath_details, "Reviews": review_list, "Host_Name": host_name, "Total_Reviews": total_review, "Host_Info": host_info } except Exception as e: print(f"Error scraping {url}: {e}") return None # Function to save data to CSV using pandas def save_to_csv(data, filename='airbnb_data.csv'): df = pd.DataFrame(data) df.to_csv(filename, index=False) print(f"Data saved to {filename}") scraped_data = [] # Scrape the details page for each URL stored in the url_list for url in url_list: print(f"Scraping details from: {url}") data = scrape_details_page(url) if data: scraped_data.append(data) # After scraping, save data to CSV if scraped_data: save_to_csv(scraped_data) else: print("No data to save.") # Close the browser driver.quit()
Avoiding Blocks: Techniques to Prevent Getting Blocked
Especially when scraping websites such as the Airbnb website which has a high traffic from single IP causing to get blocking or getting detected as bot. Well, such things can be stopped using proxies. Proxies help because they hide your true IP address, so it looks like the requests are coming from different places.
Why Use Proxies?
Without a proxy, you can get rate limited or completely blocked for sending too many requests from the same IP address. To escape this you can use a proxy which hides your IP.
Why Use Rotating Proxies?
Preventing IP Bans: Rotating proxies change the current IP address with each request (or every few requests) to decrease your scraper from being detected and thereby banned.
How to bypass Captchas: Sometimes, the website will see a lot of traffic coming from one IP address and place captchas on it. Rotating proxies will spread requests across numerous IPs, helping to avoid captchas.
I have given an example of spinning proxies with Rayobyte in this tutorial, but you can use any other reliable proxy provider that supports rotation. Rotating proxies reduce the likelihood of captchas and IP bans, allowing you to scrape faster for longer periods.
Example of using proxy:
import pandas as pd import re import time import random from selenium import webdriver from selenium_stealth import stealth from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Function to create proxy authentication extension def create_proxy_auth_extension(proxy_host, proxy_user, proxy_pass): import zipfile import os # Separate the host and port host = proxy_host.split(':')[0] port = proxy_host.split(':')[1] # Define proxy extension files manifest_json = """ { "version": "1.0.0", "manifest_version": 2, "name": "Chrome Proxy", "permissions": [ "proxy", "tabs", "unlimitedStorage", "storage", "<all_urls>", "webRequest", "webRequestBlocking" ], "background": { "scripts": ["background.js"] }, "minimum_chrome_version":"22.0.0" } """ background_js = f""" var config = {{ mode: "fixed_servers", rules: {{ singleProxy: {{ scheme: "http", host: "{host}", port: parseInt({port}) }}, bypassList: ["localhost"] }} }}; chrome.proxy.settings.set({{value: config, scope: "regular"}}, function() {{}}); chrome.webRequest.onAuthRequired.addListener( function(details) {{ return {{ authCredentials: {{ username: "{proxy_user}", password: "{proxy_pass}" }} }}; }}, {{urls: ["<all_urls>"]}}, ["blocking"] ); """ # Create the extension pluginfile = 'proxy_auth_plugin.zip' with zipfile.ZipFile(pluginfile, 'w') as zp: zp.writestr("manifest.json", manifest_json) zp.writestr("background.js", background_js) return pluginfile # Function to configure and return the WebDriver with proxy def init_driver_with_proxy(proxy_server, proxy_username, proxy_password): options = webdriver.ChromeOptions() options.add_argument("start-maximized") # Add proxy authentication if necessary if proxy_username and proxy_password: options.add_extension(create_proxy_auth_extension(proxy_server, proxy_username, proxy_password)) # Stealth mode to avoid detection driver = webdriver.Chrome(options=options) stealth(driver, languages=["en-US", "en"], vendor="Google Inc.", platform="Win32", webgl_vendor="Intel Inc.", renderer="Intel Iris OpenGL Engine", fix_hairline=True, ) return driver # Proxy pool for rotation (list of proxy servers) proxy_pool = [ {"proxy": "proxy1.com:8000", "username": "user1", "password": "pass1"}, {"proxy": "proxy2.com:8000", "username": "user2", "password": "pass2"}, {"proxy": "proxy3.com:8000", "username": "user3", "password": "pass3"} ] # Function to scrape details page (rotate proxy on each request) def scrape_details_page(url): try: # Rotate proxy by choosing a random one from the pool proxy = random.choice(proxy_pool) driver = init_driver_with_proxy(proxy['proxy'], proxy['username'], proxy['password']) driver.get(url) time.sleep(3) # Wait for the page to load html_content = driver.page_source # Regex pattern for scraping the title title_pattern = r'<h1[^>]+>([^<]+)</h1>' # Scrape the title title = re.search(title_pattern,html_content) if title: title = title.group(1) else: title = None # Scrape the price price_pattern = r'($d+[^<]+)</span></span>[^>]+></div></div>' price = re.search(price_pattern,html_content) if price: price = price.group(1) else: price = None # Scrape the address address_pattern = r'dir-ltr"><div[^>]+><section><div[^>]+ltr"><h2[^>]+>([^<]+)</h2>' address = re.search(address_pattern,html_content) if address: address = address.group(1) else: address = None # Scrape the guest guest_pattern = r'<li class="l7n4lsf[^>]+>([^<]+)<span' guest = re.search(guest_pattern,html_content) if guest: guest = guest.group(1) else: guest = None # You can add more information to scrape (example: price, description, etc.) # Scrape the bedrooms, bed, bath details bed_bath_pattern = r'</span>(d+[^<]+)' bed_bath = re.findall(bed_bath_pattern,html_content) bed_bath_details = [] if bed_bath: for bed_bath_info in bed_bath: bed_bath_details.append(bed_bath_info.strip()) #scrape reviews such as Cleanliness, Accuracy, Communication etc. reviews_pattern = r'l1nqfsv9[^>]+>([^<]+)</div>[^>]+>(d+[^<]+)</div>' reviews_details = re.findall(reviews_pattern,html_content) review_list = [] if reviews_details: for review in reviews_details: attribute, rating = review # Unpack the attribute and rating review_list.append(f'{attribute} {rating}') # Combine into a readable format #scrape host name host_name_pattern = r't1gpcl1t atm_w4_16rzvi6 atm_9s_1o8liyq atm_gi_idpfg4 dir dir-ltr[^>]+>([^<]+)' host_name = re.search(host_name_pattern,html_content) if host_name: host_name = host_name.group(1) else: host_name = None #scrape total number of review total_review_pattern = r'pdp-reviews-[^>]+>[^>]+>(d+[^<]+)</span>' total_review = re.search(total_review_pattern,html_content) if total_review: total_review = total_review.group(1) else: total_review = None #scrape host info host_info_pattern = r'd1u64sg5[^"]+atm_67_1vlbu9m dir dir-ltr[^>]+><div><span[^>]+>([^<]+)' host_info = re.findall(host_info_pattern,html_content) host_info_list = [] if host_info: for host_info_details in host_info: host_info_list.append(host_info_details) # Print the scraped information (for debugging purposes) print(f"Title: {title}n Price:{price}n Address: {address}n Guest: {guest}n bed_bath_details:{bed_bath_details}n Reviews: {review_list}n Host_name: {host_name}n total_review: {total_review}n Host Info: {host_info_list}n ") # Return the information as a dictionary (or adjust based on your needs) # Store the scraped information in a dictionary return { "url": url, "Title": title, "Price": price, "Address": address, "Guest": guest, "Bed_Bath_Details": bed_bath_details, "Reviews": review_list, "Host_Name": host_name, "Total_Reviews": total_review, "Host_Info": host_info } except Exception as e: print(f"Error scraping {url}: {e}") return None # Function to save data to CSV using pandas def save_to_csv(data, filename='airbnb_data.csv'): df = pd.DataFrame(data) df.to_csv(filename, index=False) print(f"Data saved to {filename}") # List of URLs to scrape url_list = ["https://www.airbnb.com/rooms/968367851365040114?adults=1&category_tag=Tag%3A8148&children=0&enable_m3_private_room=true&infants=0&pets=0&photo_id=1750644422&search_mode=regular_search&check_in=2025-01-18&check_out=2025-01-23&source_impression_id=p3_1729605408_P3X7GT0Ec98R7_ET&previous_page_section_name=1000&federated_search_id=62850efb-a8ab-4062-92ec-e9010fc6a24f"] # Replace with actual URLs scraped_data = [] # Scrape the details page for each URL with proxy rotation for url in url_list: print(f"Scraping details from: {url}") data = scrape_details_page(url) if data: scraped_data.append(data) # After scraping, save data to CSV if scraped_data: save_to_csv(scraped_data) else: print("No data to save.")
Legal and Ethical Issues
To prevent any potential problems, you must also scrape the data legally and ethically from websites. In every case you need to take a look at the Terms of Service from the website.
Avoid overloading the servers of the website with a lot of requests in a short amount. Use in moderation and honor timeouts to reduce the impact of disruption. No scraping should degrade the performance of a website and steal proprietary or sensitive data.
Responses