All Courses
Scraping

Basics of Google Maps Scraping

Scraping Google Maps: A Complete Guide

Welcome to Rayobyte University! Today, we're diving into the world of web scraping, focusing specifically on how to scrape data from Google Maps. This comprehensive guide will walk you through the entire process using industry-standard tools Scrapy and Playwright, complete with code examples that you can implement directly in your projects.

The example below shows the scraper for universities in Nebraska that Fabien Vauchelles made for our ever-unhappy Dean. Watch the embedded video to see the scraper being written and talked about in real-time, or simply read on for the final result.

YouTube Thumbnail

Why Scrape Google Maps?

Scraping Google Maps allows you to collect detailed information for various purposes, such as competitor analysis, market research, or lead generation. Automating this data collection not only saves time but also ensures accuracy, enabling you to focus on analyzing data rather than manually gathering it.

Step 1: Setting Up Your Environment

Before starting, it's crucial to set up your Python environment with the necessary tools:

  1. Install Scrapy and Playwright:
pip install scrapy scrapy-playwright
  1. Install Playwright with Dependencies:
playwright install --with-deps
  1. Create a Scrapy Project:
scrapy startproject googlemaps

Step 2: Writing the Spider

We will create a spider that searches for universities in Nebraska on Google Maps and scrapes relevant data such as names, ratings, and phone numbers.

  1. Create a Spider:
cd googlemaps
scrapy genspider university www.google.com
  1. Define the Data Structure: Replace the content of items.py with:
from dataclasses import dataclass, field
from itemloaders.processors import TakeFirst, MapCompose
from scrapy.loader import ItemLoader

@dataclass
class UniversityItem:
    name: str = field(default=None)
    rating: float = field(default=None)
    phone: str = field(default=None)

class UniversityItemLoader(ItemLoader):
    default_input_processor = MapCompose(str.strip)
    default_output_processor = TakeFirst()
    default_item_class = UniversityItem
  1. Implement the Spider Logic: Update spiders/university.py with the following code:
from googlemaps.items import UniversityItemLoader
from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod

class UniversitySpider(Spider):
    name = "university"

    def start_requests(self):
        yield Request(
            url="https://www.google.com/maps/search/university+in+nebraska+United+States?hl=en-US",
            callback=self.parse_universities,
            meta=dict(
                playwright=True,
                playwright_page_methods=[
                    PageMethod("wait_for_selector", selector="form"),
                    PageMethod("click", selector="button"),
                ]
            )
        )

    def parse_universities(self, response):
        links = response.css('div[role="feed"] > div > div > a')
        for link in links:
            yield response.follow(
                url=link,
                callback=self.parse_university,
                meta={"playwright": True}
            )

    def parse_university(self, response):
        item = UniversityItemLoader(response=response)
        item.add_css('name', 'h1::text')
        item.add_xpath('rating', ".//*[contains(@aria-label,'stars')]/@aria-label")
        item.add_xpath('phone', '//button[contains(@aria-label, "Phone:")]/@aria-label')
        yield item.load_item()

This code handles the entire scraping process, including dealing with Google Maps' consent form via Playwright and extracting relevant data using CSS and XPath selectors.

Step 3: Configuring Scrapy

Before running the spider, configure Scrapy to work seamlessly with Playwright:

  1. Update settings.py:
BOT_NAME = "googlemaps"

SPIDER_MODULES = ["googlemaps.spiders"]
NEWSPIDER_MODULE = "googlemaps.spiders"

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

ROBOTSTXT_OBEY = False

CONCURRENT_REQUESTS = 1
COOKIES_ENABLED = True

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

PLAYWRIGHT_LAUNCH_OPTIONS = {
    "headless": False  # Set to True for production
}

PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None

This setup ensures that Scrapy uses Playwright for handling requests and that all requests appear to be coming from a real browser.

Step 4: Running the Spider

  1. Run the Spider:
scrapy crawl university
  1. Export Results to CSV:
scrapy crawl university -o universities.csv

Step 5: Post-Processing and Data Formatting

To improve the quality of the scraped data, especially for fields like ratings and phone numbers, we'll add custom formatting functions:

  1. Update items.py with Parsing Functions:
import re

def parse_rating(x):
    try:
        return float(re.search(r"(\d+\.\d+)", x).group(1))
    except:
        return

def parse_phone(x):
    return x.split(":")[1].strip()

class UniversityItemLoader(ItemLoader):
    default_input_processor = MapCompose(str.strip)
    default_output_processor = TakeFirst()
    default_item_class = UniversityItem

    rating_in = MapCompose(parse_rating)
    phone_in = MapCompose(parse_phone)

With these enhancements, your spider will now output clean and structured data ready for analysis.

By following this guide, you're now equipped to scrape Google Maps data efficiently and at scale. With Scrapy and Playwright working together, you can bypass many of the challenges posed by dynamic content and anti-scraping measures, making this a powerful tool for your data collection needs.

See What Makes Rayobyte Special For Yourself!