Welcome to Rayobyte University! Today, we're diving into the world of web scraping, focusing specifically on how to scrape data from Google Maps. This comprehensive guide will walk you through the entire process using industry-standard tools Scrapy and Playwright, complete with code examples that you can implement directly in your projects.
The example below shows the scraper for universities in Nebraska that Fabien Vauchelles made for our ever-unhappy Dean. Watch the embedded video to see the scraper being written and talked about in real-time, or simply read on for the final result.
Scraping Google Maps allows you to collect detailed information for various purposes, such as competitor analysis, market research, or lead generation. Automating this data collection not only saves time but also ensures accuracy, enabling you to focus on analyzing data rather than manually gathering it.
Before starting, it's crucial to set up your Python environment with the necessary tools:
pip install scrapy scrapy-playwright
playwright install --with-deps
scrapy startproject googlemaps
We will create a spider that searches for universities in Nebraska on Google Maps and scrapes relevant data such as names, ratings, and phone numbers.
cd googlemaps
scrapy genspider university www.google.com
items.py
with:from dataclasses import dataclass, field
from itemloaders.processors import TakeFirst, MapCompose
from scrapy.loader import ItemLoader
@dataclass
class UniversityItem:
name: str = field(default=None)
rating: float = field(default=None)
phone: str = field(default=None)
class UniversityItemLoader(ItemLoader):
default_input_processor = MapCompose(str.strip)
default_output_processor = TakeFirst()
default_item_class = UniversityItem
spiders/university.py
with the following code:from googlemaps.items import UniversityItemLoader
from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod
class UniversitySpider(Spider):
name = "university"
def start_requests(self):
yield Request(
url="https://www.google.com/maps/search/university+in+nebraska+United+States?hl=en-US",
callback=self.parse_universities,
meta=dict(
playwright=True,
playwright_page_methods=[
PageMethod("wait_for_selector", selector="form"),
PageMethod("click", selector="button"),
]
)
)
def parse_universities(self, response):
links = response.css('div[role="feed"] > div > div > a')
for link in links:
yield response.follow(
url=link,
callback=self.parse_university,
meta={"playwright": True}
)
def parse_university(self, response):
item = UniversityItemLoader(response=response)
item.add_css('name', 'h1::text')
item.add_xpath('rating', ".//*[contains(@aria-label,'stars')]/@aria-label")
item.add_xpath('phone', '//button[contains(@aria-label, "Phone:")]/@aria-label')
yield item.load_item()
This code handles the entire scraping process, including dealing with Google Maps' consent form via Playwright and extracting relevant data using CSS and XPath selectors.
Before running the spider, configure Scrapy to work seamlessly with Playwright:
settings.py
:BOT_NAME = "googlemaps"
SPIDER_MODULES = ["googlemaps.spiders"]
NEWSPIDER_MODULE = "googlemaps.spiders"
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 1
COOKIES_ENABLED = True
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": False # Set to True for production
}
PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None
This setup ensures that Scrapy uses Playwright for handling requests and that all requests appear to be coming from a real browser.
scrapy crawl university
scrapy crawl university -o universities.csv
To improve the quality of the scraped data, especially for fields like ratings and phone numbers, we'll add custom formatting functions:
items.py
with Parsing Functions:import re
def parse_rating(x):
try:
return float(re.search(r"(\d+\.\d+)", x).group(1))
except:
return
def parse_phone(x):
return x.split(":")[1].strip()
class UniversityItemLoader(ItemLoader):
default_input_processor = MapCompose(str.strip)
default_output_processor = TakeFirst()
default_item_class = UniversityItem
rating_in = MapCompose(parse_rating)
phone_in = MapCompose(parse_phone)
With these enhancements, your spider will now output clean and structured data ready for analysis.
By following this guide, you're now equipped to scrape Google Maps data efficiently and at scale. With Scrapy and Playwright working together, you can bypass many of the challenges posed by dynamic content and anti-scraping measures, making this a powerful tool for your data collection needs.