Create a Flight Price Tracker: Scraping Airlines Ticket Prices from Google Flights using Python

Create a Flight Price Tracker: Scraping Airlines Ticket Prices from Google Flights using Python

Source code: google_flight_scraper

Table of Content

Introduction
Ethical Consideration
Data that we want to scrape
Prerequisites
Project Setup
Why Playwright?
Setting up Browser Automation
Understanding Google Flights URL Structure
Scrape the the flight data
Saving to CSV
The complete code
Setting Up Proxy Rotation
Conclusion
Disclaimer

Introduction

Google Flights aggregates data from various airlines and travel companies, providing travelers with comprehensive information about available flights, pricing, and schedules. This allows travelers to compare airline prices, assess flight durations, and monitor environmental impacts, ultimately helping them secure the best travel deals.

In this tutorial, I will guide you through the process of scraping essential flight data from Google Flights using Python and Playwright. You will learn how to extract valuable information such as departure and arrival times, flight durations, prices, and more—all while ensuring that your scraping methods are effective and efficient.

This information is not only valuable for individual travelers but also for businesses. Companies can leverage flight data to conduct competitor analysis, understand customer preferences, and make informed decisions about pricing and marketing strategies. By scraping data from Google Flights, businesses can gain insights into market trends and optimize their offerings to better meet the needs of their customers.

Ethical Consideration

Web scraping involves legal and ethical responsibilities:

Respect website terms of service
Avoid overwhelming server resources
Use scraping for research and personal purposes
Implement rate limiting and proxy rotation
Ensure data is not used for commercial exploitation without permission

Data that we want to scrape

We will collect this comprehensive flight information:

Departure times
Arrival times
Airline name
Flight duration
Number of stops
Price
CO2 emissions
Emissions comparison with typical flights.

Prerequisites

Python 3.7+
Basic Python knowledge
Required packages: playwright and asyncio

pip install playwright asyncio
playwright install  # Install browser binaries

Why Playwright?

Playwright is a modern automation framework that makes browser automation straightforward. It supports multiple browsers and offers robust features for handling dynamic websites, including waiting for elements to load and intercepting network requests.

Its asynchronous capabilities allow for efficient handling of multiple tasks, making it suitable for web scraping where speed and performance are crucial.

Setting up Browser Automation

async def setup_browser():
    p = await async_playwright().start()
    browser = await p.chromium.launch(headless=False)  # Set to True in production
    page = await browser.new_page()
    return p, browser, page

This function initializes the Playwright browser, allowing for web scraping of flight data. The headless parameter can be toggled for visibility during development.

Understanding Google Flights URL Structure

One of the trickiest parts of scraping Google Flights is constructing the correct URLs. Google Flights encodes search parameters in base64 to ensure compactness and security.

browser url

Search browser

Once we click on the “search” button, we will notice something like this appears in the url “CBwQAhoeEgoyMDI0LTEyLTI1agcIARIDU0ZPcgcIARIDTEFYQAFIAXABggELCP___________wGYAQI”

Let’s decode this url by using base64

import base64

encoded = "CBwQAhoeEgoyMDI0LTEyLTI1agcIARIDU0ZPcgcIARIDTEFYQAFIAXABggELCP___________wGYAQI"
decoded = base64.urlsafe_b64decode(encoded + "==")
print(decoded)

decode url result

From the result, we can confirm that the flight is on 25th of December 2024 departing from San Francisco (SFO) to Los Angeles (LAX)

In order to incorporate this in our scraper, we need to reverse this decode to create the encoding URL.

Let’s break down how to handles this issue by creating a Class of FlightURLBuilder.

class FlightURLBuilder:

Creating Binary Data

@staticmethod
def _create_one_way_bytes(departure: str, destination: str, date: str) -> bytes:
    return (
        b'x08x1cx10x02x1ax1ex12n' + date.encode() +
        b'jx07x08x01x12x03' + departure.encode() +
        b'rx07x08x01x12x03' + destination.encode() +
        b'@x01Hx01px01x82x01x0bx08xfcx06`x04x08'
    )

This code generates a bytes object that encodes the flight details (departure, destination, and date).

Modifying a Base64 String

This is the result from the encoding

“CBwQAhoeEgoyMDI0LTEyLTI1agcIARIDU0ZPcgcIARIDTEFYQAFIAXABggELCPwGYAQI”

As the URL should contain 7 underscores before the 6 characters from the end.

@staticmethod
def _modify_base64(encoded_str: str) -> str:
    insert_index = len(encoded_str) - 6
    return encoded_str[:insert_index] + '_' * 7 + encoded_str[insert_index:]

Building the Full URL

Lastly, let’s generates a complete Google Flights URL by adding the “https://www.google.com/travel/flights/search?tfs=” at the start

@classmethod
def build_url(
    cls,
    departure: str,
    destination: str,
    departure_date: str
) -> str:
    flight_bytes = cls._create_one_way_bytes(departure, destination, departure_date)
    base64_str = base64.b64encode(flight_bytes).decode('utf-8')
    modified_str = cls._modify_base64(base64_str)
    return f'https://www.google.com/travel/flights/search?tfs={modified_str}'

Scrape the the flight data

We will extract the element using selector and aria-label (if aplicable)

async def extract_flight_element_text(flight, selector: str, aria_label: Optional[str] = None) -> str:
    if aria_label:
        element = await flight.query_selector(f'{selector}[aria-label*="{aria_label}"]')
    else:
        element = await flight.query_selector(selector)
    return await element.inner_text() if element else "N/A"

The extract_flight_element_text function is an asynchronous utility designed to extract text from elements on a web page. Here’s how it works:

Parameters:

flight: The web element to search within.
selector: A string representing the CSS selector to locate the element.
aria_label (optional): An accessibility label to refine the search within the selected elements.

Logic:

If an aria_label is provided, the function adds a condition to the selector to search for elements containing the specified label.
It queries the element using the combined selector.

Return Value:

If the element is found, the function returns its inner text.
If no element matches the query, it returns “N/A” as a fallback.

Departure time

Departure time element

departure_time = await extract_flight_element_text(flight, 'span', "Departure time")

Arrival Time

Arrival time element

arrival_time =  await extract_flight_element_text(flight, 'span', "Arrival time")

Airline

Airline element

airline = await extract_flight_element_text(flight, ".sSHqwe")

Flight Duration

Flight duration element

duration = await extract_flight_element_text(flight, "div.gvkrdb")

Stops

Stop element

stops =  await extract_flight_element_text(flight, "div.EfT7Ae span.ogfYpf")

Price

Price element

price =  await extract_flight_element_text(flight, "div.FpEdX span")

CO2 emission

co2 emission element

co2_emissions =  await extract_flight_element_text(flight, "div.O7CXue")

Emission Variation

Emission variation element

emissions_variation =  await extract_flight_element_text(flight, "div.N6PNV")

Saving to CSV

Before save the information in csv format, we need to make sure the data is clean from any unwanted characters.

def clean_csv(filename: str):
    data = pd.read_csv(filename, encoding="utf-8")   
    def clean_text(value):
        if isinstance(value, str):
            return value.replace('Â', '').replace(' ', ' ').replace('Ã', '').replace('¶', '').strip()
        return value

    cleaned_data = data.applymap(clean_text)
    cleaned_file_path = f"{filename}"
    cleaned_data.to_csv(cleaned_file_path, index=False)
    print(f"Cleaned CSV saved to: {cleaned_file_path}")

def save_to_csv(data: List[Dict[str, str]], filename: str = "flight_data.csv") -> None:
    if not data:
        return    
    headers = list(data[0].keys())    
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=headers)
        writer.writeheader()
        writer.writerows(data)
    
    # Clean the saved CSV
    clean_csv(filename)

Here’s the result for the flight departing from San Francisco (SFO) to Los Angeles (LAX) on 25th of December 2024

CSV result

The complete code

import asyncio
import csv
import base64
from playwright.async_api import async_playwright
from typing import List, Dict, Optional
import pandas as pd


class FlightURLBuilder:
    """Class to handle flight URL creation with base64 encoding."""
    
    @staticmethod
    def _create_one_way_bytes(departure: str, destination: str, date: str) -> bytes:
        """Create bytes for one-way flight."""
        return (
            b'x08x1cx10x02x1ax1ex12n' + date.encode() +
            b'jx07x08x01x12x03' + departure.encode() +
            b'rx07x08x01x12x03' + destination.encode() +
            b'@x01Hx01px01x82x01x0bx08xfcx06`x04x08'
        )
    
    @staticmethod
    def _modify_base64(encoded_str: str) -> str:
        """Add underscores at the specific position in base64 string."""
        insert_index = len(encoded_str) - 6
        return encoded_str[:insert_index] + '_' * 7 + encoded_str[insert_index:]

    @classmethod
    def build_url(
        cls,
        departure: str,
        destination: str,
        departure_date: str
    ) -> str:
        
        flight_bytes = cls._create_one_way_bytes(departure, destination, departure_date)
        base64_str = base64.b64encode(flight_bytes).decode('utf-8')
        modified_str = cls._modify_base64(base64_str)
        return f'https://www.google.com/travel/flights/search?tfs={modified_str}'


async def setup_browser():
    p = await async_playwright().start()
    browser = await p.chromium.launch(headless=False)
    page = await browser.new_page()
    return p, browser, page


async def extract_flight_element_text(flight, selector: str, aria_label: Optional[str] = None) -> str:
    """Extract text from a flight element using selector and optional aria-label."""
    if aria_label:
        element = await flight.query_selector(f'{selector}[aria-label*="{aria_label}"]')
    else:
        element = await flight.query_selector(selector)
    return await element.inner_text() if element else "N/A"


async def scrape_flight_info(flight) -> Dict[str, str]:
    """Extract all relevant information from a single flight element."""
    departure_time = await extract_flight_element_text(flight, 'span', "Departure time")
    arrival_time =  await extract_flight_element_text(flight, 'span', "Arrival time")
    airline = await extract_flight_element_text(flight, ".sSHqwe")
    duration = await extract_flight_element_text(flight, "div.gvkrdb")
    stops =  await extract_flight_element_text(flight, "div.EfT7Ae span.ogfYpf")
    price =  await extract_flight_element_text(flight, "div.FpEdX span")
    co2_emissions =  await extract_flight_element_text(flight, "div.O7CXue")
    emissions_variation =  await extract_flight_element_text(flight, "div.N6PNV")
    return {
        "Departure Time": departure_time,
        "Arrival Time": arrival_time,
        "Airline Company": airline,
        "Flight Duration": duration,
        "Stops": stops,
        "Price": price,
        "co2 emissions": co2_emissions,
        "emissions variation": emissions_variation
    }

def clean_csv(filename: str):
    """Clean unwanted characters from the saved CSV file."""
    data = pd.read_csv(filename, encoding="utf-8")
    
    def clean_text(value):
        if isinstance(value, str):
            return value.replace('Â', '').replace(' ', ' ').replace('Ã', '').replace('¶', '').strip()
        return value

    cleaned_data = data.applymap(clean_text)
    cleaned_file_path = f"{filename}"
    cleaned_data.to_csv(cleaned_file_path, index=False)
    print(f"Cleaned CSV saved to: {cleaned_file_path}")

def save_to_csv(data: List[Dict[str, str]], filename: str = "flight_data.csv") -> None:
    """Save flight data to a CSV file."""
    if not data:
        return
    
    headers = list(data[0].keys())
    
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=headers)
        writer.writeheader()
        writer.writerows(data)
    
    # Clean the saved CSV
    clean_csv(filename)

async def scrape_flight_data(one_way_url):
    flight_data = []

    playwright, browser, page = await setup_browser()
    
    try:
        await page.goto(one_way_url)
        
        # Wait for flight data to load
        await page.wait_for_selector(".pIav2d")
        
        # Get all flights and extract their information
        flights = await page.query_selector_all(".pIav2d")
        for flight in flights:
            flight_info = await scrape_flight_info(flight)
            flight_data.append(flight_info)
        
        # Save the extracted data in CSV format
        save_to_csv(flight_data)
            
    finally:
        await browser.close()
        await playwright.stop()

if __name__ == "__main__":
    one_way_url = FlightURLBuilder.build_url(
        departure="SFO",
        destination="LAX",
        departure_date="2024-12-25"
    )
    print("One-way URL:", one_way_url)

    # Run the scraper
    asyncio.run(scrape_flight_data(one_way_url))

Setting Up Proxy Rotation

For large scale scrapers, using proxies helps to distribute your requests across multiple IPs, reducing the risk of being blocked.

I’m using the free residential proxy from Rayobyte.

Save the proxy credential in .env file

# Proxy Configuration
PROXY_SERVER=http://proxy.example.com:8080
PROXY_USERNAME=your_username
PROXY_PASSWORD=your_password
PROXY_BYPASS=localhost,127.0.0.1

Setup the proxy

class ProxyConfig:
    def __init__(self):
        self.server = os.getenv('PROXY_SERVER')
        self.username = os.getenv('PROXY_USERNAME')
        self.password = os.getenv('PROXY_PASSWORD')
        self.bypass = os.getenv('PROXY_BYPASS')

    def get_proxy_settings(self) -> Optional[Dict]:
        if not self.server:
            return None

        proxy_settings = {
            "server": self.server
        }

        if self.username and self.password:
            proxy_settings.update({
                "username": self.username,
                "password": self.password
            })

        if self.bypass:
            proxy_settings["bypass"] = self.bypass

        return proxy_settings

    @property
    def is_configured(self) -> bool:
        return bool(self.server)

Setup the browser with the proxy

async def setup_browser():
    p = await async_playwright().start()  
    browser_settings = {
        "headless": False
    }
    
    # Initialize proxy configuration from environment variables
    proxy_config = ProxyConfig()
    if proxy_config.is_configured:
        proxy_settings = proxy_config.get_proxy_settings()
        if proxy_settings:
            browser_settings["proxy"] = proxy_settings
   
    browser = await p.chromium.launch(**browser_settings)
    page = await browser.new_page()

    return p, browser, page

Conclusion

Building a flight price tracker using Python and Playwright allows you to automate the collection of valuable flight data for personal or business purposes. In this tutorial, you learned how to:

Understand and decode Google Flights URLs
Automate browser actions with Playwright
Extract and clean flight data
Save data in a structured CSV format
Enhance the scraper with proxy rotation and delays

Disclaimer

Scraping Google Flights or any similar service should always adhere to ethical guidelines and respect the site’s terms of service. This tutorial is for educational purposes only. Use this tool responsibly and only for permitted purposes.

Check out this GitHub Repository for the complete source code.

Happy Scraping! 🚀

Responses

You must be logged in to post a comment.