Scrape Google My Business Data Using Python: A Step-by-Step Guide
Table of content
- Introduction
- Prerequisites
- Import Libraries
- Define the Scrape Function
- Step-by-Step Guide
- Important Notes
- Conclusion
Google My Business is a crucial tool for businesses to manage their online presence. In this tutorial, we’ll show you how to build a Google My Business scraper using Python. You’ll learn how to extract valuable business information such as business names, reviews, ratings, contact details, and more. This tool will help you gather insights and manage your business’s online reputation effectively.
What You’ll Need:
- Python 3.x
- Playwright library for web scraping
- CSV for storing the scraped data
- Proxy support for anonymity
- Basic understanding of HTML and CSS selectors
Introduction
Google My Business is an essential platform that helps businesses appear in local search results on Google, including Google Maps. With millions of businesses listing their details online, scraping data from Google My Business can provide valuable insights into local markets, business performance, and competition.
In this step-by-step guide, we will walk you through creating a scraper using Python and the Playwright library to extract business data, including:
- Business Name
- Address
- Phone Number
- Website
- Ratings & Reviews
The Python code provided will allow you to scrape Google My Business data directly from Google Search results, store it in a CSV file, and use a proxy to enhance your scraping process and avoid getting blocked by Google.
Prerequisites
Before we dive into the code, make sure you have Python 3.x installed on your computer. You will also need to install the Playwright library, which is a powerful web automation tool for Python.
Run the following command to install Playwright:
pip install playwright python -m playwright install
You will also need a proxy service to hide your identity while scraping. This will prevent your IP from being blocked by Google. If you don’t have one, consider using paid proxy services like rayobyte.
Here’s how you can set up the proxy in your script.
Step-by-Step Guide
Step 1: Import Libraries
We will use the sync_playwright function from the Playwright library. This will allow us to interact with web pages as if we were using a browser. Additionally, we’ll import the CSV library to save the scraped data.
from playwright.sync_api import sync_playwright import csv
Step 2: Define the Scrape Function
The scrape_page()
function is designed to scrape specific information from a Google My Business listing, such as:
- Business Name
- Address
- Phone Number
- Website
- Ratings and Reviews
Here’s how it works:
def scrape_page(page, writer): # Scrape restaurant-type business all_business = page.query_selector_all(".rllt__details") for business in all_business: business.click() page.wait_for_timeout(2000) # Wait for 2 seconds # Extract business info business_name = page.query_selector('.SPZz6b') business_name = business_name.text_content() if business_name else "not found" business_address = page.query_selector(".LrzXr") business_address = business_address.text_content() if business_address else "not found" try: business_phone_number = page.query_selector(".LrzXr.zdqRlf.kno-fv") business_phone_number = business_phone_number.text_content() if business_phone_number else "not found" except: business_phone_number = "not found" try: business_website = page.query_selector(".xFAlBc") business_website = business_website.text_content() if business_website else "not found" except: business_website = "not found" try: rating_reviews = page.query_selector(".TLYLSe.MaBy9") rating_reviews = rating_reviews.text_content() if rating_reviews else "not found" except: rating_reviews = "not found" # Store the scraped data in CSV writer.writerow([business_name, business_address, business_phone_number, business_website, rating_reviews]) print(f"Data saved: {business_name}, {business_address}, {business_phone_number}, {business_website}, {rating_reviews}n")
You will also need a proxy service to hide your identity while scraping. This will prevent your IP from being blocked by Google. If you don’t have one, consider using paid proxy services like rayobyte.
Here’s how you can set up the proxy in your script.
Step 3: Main Function to Scrape Data and Use Proxy
The main() function will use Playwright to navigate through the pages, scrape the data, and store it in a CSV file. It also includes proxy support to help hide your identity during the scraping process.
def main(): with sync_playwright() as p: # Set up the proxy and the browser context browser = p.chromium.launch(headless=False, slow_mo=50) context = browser.new_context( viewport={"width": 1920, "height": 1080}, device_scale_factor=1, proxy={ "server": "", # Replace with your proxy server address and port "username": "", # Replace with your proxy username (if required) "password": "" # Replace with your proxy password (if required) } ) page = context.new_page() url = input("Give URL and press enter: ").strip() page.goto(url) # Open CSV file to store the scraped data with open('google_my_business_data.csv', mode='w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(["Business Name", "Business Address", "Phone Number", "Website", "Ratings & Reviews"]) while True: page.wait_for_timeout(1000) # Wait for 1 second scrape_page(page, writer) try: # Check for and click the next page button next_page = page.query_selector(".oeN89d") if next_page: next_page.click() page.wait_for_timeout(2000) # Wait for 2 seconds else: print("No more pages.") break except Exception as e: print("Error navigating to next page:", e) break browser.close() if __name__ == "__main__": main()
Here is a screenshot of how the CSV result looks
Step 4: Running the Script
Once the script is ready, save it as a Python file (e.g., google_business_scrape.py
) and run it. The script will prompt you for a Google My Business URL, scrape the listings, and store the information in a CSV file. You can easily modify the script to handle more complex tasks or scrape more details.
Important Notes
1. Google’s Continuous HTML Updates
Google frequently updates the structure of its HTML pages. This means that the CSS selectors used in the scraper may not always work. If the script stops working or throws errors, you may need to update the CSS selectors in the script to match the new structure. Here are some things to check:
- Element Class Names: These may change over time. The script uses class names like
.rllt__details
or.LrzXr
. If Google changes these, the script won’t be able to find the data. - Element Structure: The order or position of certain elements on the page may change, requiring updates to the scraper.
To fix these issues, inspect the page elements using a browser’s developer tools (F12) to find the new CSS selectors and update the script accordingly.
2. Legal Considerations
Scraping Google My Business data may violate Google’s terms of service. Always ensure that you are scraping data in accordance with the relevant legal guidelines and the site’s terms.
3. Proxy Usage
Using proxies is important to avoid being blocked by Google while scraping. You can use a proxy service to change your IP address for each request, thus ensuring anonymity. Here’s an example of how to configure the proxy in Playwright:
context = browser.new_context( proxy={ "server": "server_name:port", "username": "username", "password": "password"} )
Make sure to replace server_name:port
, username
, and password
with your actual proxy details. Most proxy services will provide these details when you sign up or subscribe to their services.
Conclusion
In this guide, we showed how to build a Google My Business scraper using Python and Playwright. This script extracts business information like name, address, phone number, website, and ratings, and stores it in a CSV file. Additionally, we integrated proxy support to help prevent blocking during scraping.
Remember that Google frequently updates its HTML structure, so keep your CSS selectors up to date. Always respect legal guidelines and Google’s terms of service when scraping data from their platform.
Happy scraping!
Responses