Create a Flight Price Tracker: Scraping Airlines Ticket Prices from Google Flights using Python
Create a Flight Price Tracker: Scraping Airlines Ticket Prices from Google Flights using Python
Source code: google_flight_scraper
Table of Content
Introduction
Ethical Consideration
Data that we want to scrape
Prerequisites
Project Setup
Why Playwright?
Setting up Browser Automation
Understanding Google Flights URL Structure
Scrape the the flight data
Saving to CSV
The complete code
Setting Up Proxy Rotation
Conclusion
Disclaimer
Introduction
Google Flights aggregates data from various airlines and travel companies, providing travelers with comprehensive information about available flights, pricing, and schedules. This allows travelers to compare airline prices, assess flight durations, and monitor environmental impacts, ultimately helping them secure the best travel deals.
In this tutorial, I will guide you through the process of scraping essential flight data from Google Flights using Python and Playwright. You will learn how to extract valuable information such as departure and arrival times, flight durations, prices, and more—all while ensuring that your scraping methods are effective and efficient.
This information is not only valuable for individual travelers but also for businesses. Companies can leverage flight data to conduct competitor analysis, understand customer preferences, and make informed decisions about pricing and marketing strategies. By scraping data from Google Flights, businesses can gain insights into market trends and optimize their offerings to better meet the needs of their customers.
Ethical Consideration
Web scraping involves legal and ethical responsibilities:
- Respect website terms of service
- Avoid overwhelming server resources
- Use scraping for research and personal purposes
- Implement rate limiting and proxy rotation
- Ensure data is not used for commercial exploitation without permission
Data that we want to scrape
We will collect this comprehensive flight information:
- Departure times
- Arrival times
- Airline name
- Flight duration
- Number of stops
- Price
- CO2 emissions
- Emissions comparison with typical flights.
Prerequisites
- Python 3.7+
- Basic Python knowledge
- Required packages:
playwright
andasyncio
pip install playwright asyncio playwright install # Install browser binaries
Why Playwright?
Playwright is a modern automation framework that makes browser automation straightforward. It supports multiple browsers and offers robust features for handling dynamic websites, including waiting for elements to load and intercepting network requests.
Its asynchronous capabilities allow for efficient handling of multiple tasks, making it suitable for web scraping where speed and performance are crucial.
Setting up Browser Automation
async def setup_browser(): p = await async_playwright().start() browser = await p.chromium.launch(headless=False) # Set to True in production page = await browser.new_page() return p, browser, page
This function initializes the Playwright browser, allowing for web scraping of flight data. The headless parameter can be toggled for visibility during development.
Understanding Google Flights URL Structure
One of the trickiest parts of scraping Google Flights is constructing the correct URLs. Google Flights encodes search parameters in base64 to ensure compactness and security.
Once we click on the “search” button, we will notice something like this appears in the url “CBwQAhoeEgoyMDI0LTEyLTI1agcIARIDU0ZPcgcIARIDTEFYQAFIAXABggELCP___________wGYAQI”
Let’s decode this url by using base64
import base64 encoded = "CBwQAhoeEgoyMDI0LTEyLTI1agcIARIDU0ZPcgcIARIDTEFYQAFIAXABggELCP___________wGYAQI" decoded = base64.urlsafe_b64decode(encoded + "==") print(decoded)
From the result, we can confirm that the flight is on 25th of December 2024 departing from San Francisco (SFO) to Los Angeles (LAX)
In order to incorporate this in our scraper, we need to reverse this decode to create the encoding URL.
Let’s break down how to handles this issue by creating a Class of FlightURLBuilder
.
class FlightURLBuilder:
Creating Binary Data
@staticmethod def _create_one_way_bytes(departure: str, destination: str, date: str) -> bytes: return ( b'x08x1cx10x02x1ax1ex12n' + date.encode() + b'jx07x08x01x12x03' + departure.encode() + b'rx07x08x01x12x03' + destination.encode() + b'@x01Hx01px01x82x01x0bx08xfcx06`x04x08' )
This code generates a bytes
object that encodes the flight details (departure, destination, and date).
Modifying a Base64 String
This is the result from the encoding
“CBwQAhoeEgoyMDI0LTEyLTI1agcIARIDU0ZPcgcIARIDTEFYQAFIAXABggELCPwGYAQI”
As the URL should contain 7 underscores before the 6 characters from the end.
@staticmethod def _modify_base64(encoded_str: str) -> str: insert_index = len(encoded_str) - 6 return encoded_str[:insert_index] + '_' * 7 + encoded_str[insert_index:]
Building the Full URL
Lastly, let’s generates a complete Google Flights URL by adding the “https://www.google.com/travel/flights/search?tfs=” at the start
@classmethod def build_url( cls, departure: str, destination: str, departure_date: str ) -> str: flight_bytes = cls._create_one_way_bytes(departure, destination, departure_date) base64_str = base64.b64encode(flight_bytes).decode('utf-8') modified_str = cls._modify_base64(base64_str) return f'https://www.google.com/travel/flights/search?tfs={modified_str}'
Scrape the the flight data
We will extract the element using selector and aria-label (if aplicable)
async def extract_flight_element_text(flight, selector: str, aria_label: Optional[str] = None) -> str: if aria_label: element = await flight.query_selector(f'{selector}[aria-label*="{aria_label}"]') else: element = await flight.query_selector(selector) return await element.inner_text() if element else "N/A"
The extract_flight_element_text
function is an asynchronous utility designed to extract text from elements on a web page. Here’s how it works:
Parameters:
- flight: The web element to search within.
- selector: A string representing the CSS selector to locate the element.
- aria_label (optional): An accessibility label to refine the search within the selected elements.
Logic:
- If an aria_label is provided, the function adds a condition to the selector to search for elements containing the specified label.
- It queries the element using the combined selector.
Return Value:
- If the element is found, the function returns its inner text.
- If no element matches the query, it returns “N/A” as a fallback.
Departure time
departure_time = await extract_flight_element_text(flight, 'span', "Departure time")
Arrival Time
arrival_time = await extract_flight_element_text(flight, 'span', "Arrival time")
Airline
airline = await extract_flight_element_text(flight, ".sSHqwe")
Flight Duration
duration = await extract_flight_element_text(flight, "div.gvkrdb")
Stops
stops = await extract_flight_element_text(flight, "div.EfT7Ae span.ogfYpf")
Price
price = await extract_flight_element_text(flight, "div.FpEdX span")
CO2 emission
co2_emissions = await extract_flight_element_text(flight, "div.O7CXue")
Emission Variation
emissions_variation = await extract_flight_element_text(flight, "div.N6PNV")
Saving to CSV
Before save the information in csv format, we need to make sure the data is clean from any unwanted characters.
def clean_csv(filename: str): data = pd.read_csv(filename, encoding="utf-8") def clean_text(value): if isinstance(value, str): return value.replace('Â', '').replace(' ', ' ').replace('Ã', '').replace('¶', '').strip() return value cleaned_data = data.applymap(clean_text) cleaned_file_path = f"{filename}" cleaned_data.to_csv(cleaned_file_path, index=False) print(f"Cleaned CSV saved to: {cleaned_file_path}")
def save_to_csv(data: List[Dict[str, str]], filename: str = "flight_data.csv") -> None: if not data: return headers = list(data[0].keys()) with open(filename, 'w', newline='', encoding='utf-8') as csvfile: writer = csv.DictWriter(csvfile, fieldnames=headers) writer.writeheader() writer.writerows(data) # Clean the saved CSV clean_csv(filename)
Here’s the result for the flight departing from San Francisco (SFO) to Los Angeles (LAX) on 25th of December 2024
The complete code
import asyncio import csv import base64 from playwright.async_api import async_playwright from typing import List, Dict, Optional import pandas as pd class FlightURLBuilder: """Class to handle flight URL creation with base64 encoding.""" @staticmethod def _create_one_way_bytes(departure: str, destination: str, date: str) -> bytes: """Create bytes for one-way flight.""" return ( b'x08x1cx10x02x1ax1ex12n' + date.encode() + b'jx07x08x01x12x03' + departure.encode() + b'rx07x08x01x12x03' + destination.encode() + b'@x01Hx01px01x82x01x0bx08xfcx06`x04x08' ) @staticmethod def _modify_base64(encoded_str: str) -> str: """Add underscores at the specific position in base64 string.""" insert_index = len(encoded_str) - 6 return encoded_str[:insert_index] + '_' * 7 + encoded_str[insert_index:] @classmethod def build_url( cls, departure: str, destination: str, departure_date: str ) -> str: flight_bytes = cls._create_one_way_bytes(departure, destination, departure_date) base64_str = base64.b64encode(flight_bytes).decode('utf-8') modified_str = cls._modify_base64(base64_str) return f'https://www.google.com/travel/flights/search?tfs={modified_str}' async def setup_browser(): p = await async_playwright().start() browser = await p.chromium.launch(headless=False) page = await browser.new_page() return p, browser, page async def extract_flight_element_text(flight, selector: str, aria_label: Optional[str] = None) -> str: """Extract text from a flight element using selector and optional aria-label.""" if aria_label: element = await flight.query_selector(f'{selector}[aria-label*="{aria_label}"]') else: element = await flight.query_selector(selector) return await element.inner_text() if element else "N/A" async def scrape_flight_info(flight) -> Dict[str, str]: """Extract all relevant information from a single flight element.""" departure_time = await extract_flight_element_text(flight, 'span', "Departure time") arrival_time = await extract_flight_element_text(flight, 'span', "Arrival time") airline = await extract_flight_element_text(flight, ".sSHqwe") duration = await extract_flight_element_text(flight, "div.gvkrdb") stops = await extract_flight_element_text(flight, "div.EfT7Ae span.ogfYpf") price = await extract_flight_element_text(flight, "div.FpEdX span") co2_emissions = await extract_flight_element_text(flight, "div.O7CXue") emissions_variation = await extract_flight_element_text(flight, "div.N6PNV") return { "Departure Time": departure_time, "Arrival Time": arrival_time, "Airline Company": airline, "Flight Duration": duration, "Stops": stops, "Price": price, "co2 emissions": co2_emissions, "emissions variation": emissions_variation } def clean_csv(filename: str): """Clean unwanted characters from the saved CSV file.""" data = pd.read_csv(filename, encoding="utf-8") def clean_text(value): if isinstance(value, str): return value.replace('Â', '').replace(' ', ' ').replace('Ã', '').replace('¶', '').strip() return value cleaned_data = data.applymap(clean_text) cleaned_file_path = f"{filename}" cleaned_data.to_csv(cleaned_file_path, index=False) print(f"Cleaned CSV saved to: {cleaned_file_path}") def save_to_csv(data: List[Dict[str, str]], filename: str = "flight_data.csv") -> None: """Save flight data to a CSV file.""" if not data: return headers = list(data[0].keys()) with open(filename, 'w', newline='', encoding='utf-8') as csvfile: writer = csv.DictWriter(csvfile, fieldnames=headers) writer.writeheader() writer.writerows(data) # Clean the saved CSV clean_csv(filename) async def scrape_flight_data(one_way_url): flight_data = [] playwright, browser, page = await setup_browser() try: await page.goto(one_way_url) # Wait for flight data to load await page.wait_for_selector(".pIav2d") # Get all flights and extract their information flights = await page.query_selector_all(".pIav2d") for flight in flights: flight_info = await scrape_flight_info(flight) flight_data.append(flight_info) # Save the extracted data in CSV format save_to_csv(flight_data) finally: await browser.close() await playwright.stop() if __name__ == "__main__": one_way_url = FlightURLBuilder.build_url( departure="SFO", destination="LAX", departure_date="2024-12-25" ) print("One-way URL:", one_way_url) # Run the scraper asyncio.run(scrape_flight_data(one_way_url))
Setting Up Proxy Rotation
For large scale scrapers, using proxies helps to distribute your requests across multiple IPs, reducing the risk of being blocked.
I’m using the free residential proxy from Rayobyte.
Save the proxy credential in .env file
# Proxy Configuration PROXY_SERVER=http://proxy.example.com:8080 PROXY_USERNAME=your_username PROXY_PASSWORD=your_password PROXY_BYPASS=localhost,127.0.0.1
Setup the proxy
class ProxyConfig: def __init__(self): self.server = os.getenv('PROXY_SERVER') self.username = os.getenv('PROXY_USERNAME') self.password = os.getenv('PROXY_PASSWORD') self.bypass = os.getenv('PROXY_BYPASS') def get_proxy_settings(self) -> Optional[Dict]: if not self.server: return None proxy_settings = { "server": self.server } if self.username and self.password: proxy_settings.update({ "username": self.username, "password": self.password }) if self.bypass: proxy_settings["bypass"] = self.bypass return proxy_settings @property def is_configured(self) -> bool: return bool(self.server)
Setup the browser with the proxy
async def setup_browser(): p = await async_playwright().start() browser_settings = { "headless": False } # Initialize proxy configuration from environment variables proxy_config = ProxyConfig() if proxy_config.is_configured: proxy_settings = proxy_config.get_proxy_settings() if proxy_settings: browser_settings["proxy"] = proxy_settings browser = await p.chromium.launch(**browser_settings) page = await browser.new_page() return p, browser, page
Conclusion
Building a flight price tracker using Python and Playwright allows you to automate the collection of valuable flight data for personal or business purposes. In this tutorial, you learned how to:
- Understand and decode Google Flights URLs
- Automate browser actions with Playwright
- Extract and clean flight data
- Save data in a structured CSV format
- Enhance the scraper with proxy rotation and delays
Disclaimer
Scraping Google Flights or any similar service should always adhere to ethical guidelines and respect the site’s terms of service. This tutorial is for educational purposes only. Use this tool responsibly and only for permitted purposes.
Check out this GitHub Repository for the complete source code.
Happy Scraping! 🚀
Responses