How to Parse Google Search Results: A Complete Guide with AI-Powered Extraction

Watch the full tutorial on YouTube

How to Parse Google Search Results: A Complete Guide with AI-Powered Extraction

Google search results are a goldmine of data, packed with everything from organic web results and featured snippets to local business listings, videos, and shopping ads. However, extracting this information programmatically from raw HTML is no small feat due to the diverse and complex structure of Google’s search engine results pages (SERPs). In this comprehensive guide, I’ll show you how to build a powerful Google search results parser using Google’s cutting-edge Gemini 2.5 Pro AI model, capable of identifying and extracting over 20 different types of search results into a clean, structured JSON format.

Note: This tutorial focuses solely on parsing already-scraped HTML content. If you need to learn how to scrape Google search results, check out my other tutorial: How to Scrape Google Search Results with Python (#).

We’ll explore the complete Python code, explain why the Gemini 2.5 Pro model is ideal for this task, and break down the five key components of the implementation. By the end, you’ll have a robust, AI-powered tool to parse Google search results efficiently and a clear understanding of how to leverage large language models (LLMs) for complex data extraction.

Why Use the Gemini 2.5 Pro AI Model?

Before diving into the code, let’s explore why Google’s Gemini 2.5 Pro, the most advanced and recent model in the Gemini family, is a game-changer for parsing Google search results. This powerful LLM is designed to handle complex inputs like raw HTML with remarkable accuracy. Here’s why it’s the perfect choice for this task:

Large Input Capacity: Gemini 2.5 Pro can process extensive HTML content, making it ideal for analyzing Google’s SERPs, which often include diverse elements like ads, knowledge panels, and local business listings.
Advanced Contextual Understanding: Its state-of-the-art natural language processing (NLP) capabilities allow it to interpret HTML structures and categorize result types (e.g., organic results, videos, recipes) with precision.
Structured JSON Output: Gemini 2.5 Pro can generate clean, machine-readable JSON based on detailed prompts, ensuring the extracted data is ready for analysis or integration into other tools.
Robust Handling of Variability: It adapts to inconsistencies in HTML structure, reducing the need for manual preprocessing or fragile parsing rules.
Efficiency and Scalability: The model’s API is optimized for processing large datasets, making it suitable for projects requiring bulk SERP analysis.

By using Gemini 2.5 Pro, we eliminate the need for tedious manual parsing (e.g., via regex or DOM traversal), letting the AI handle the complexity of Google’s dynamic SERP layouts.

The Code: Parsing Google Search Results with Gemini

Below is the complete Python code to parse Google search results using the Gemini AI model. The script takes an HTML file as input, processes it with Gemini, and outputs a structured JSON file containing categorized results. It also generates a user-friendly summary of the extracted data.

import json
import os
from datetime import datetime
from google import genai

# Configure Gemini API
client = genai.Client(api_key="AIzaSyASRXxmroFCVGdrSr2B3H8U8izbvk8L7u4")

def read_html_file():
    """Read HTML file from user input"""
    filename = input("Enter the HTML file name (with extension, e.g., 'search_results.html'): ").strip()
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            contents = file.read()
            print(f"✓ File '{filename}' read successfully!")
            print(f"File size: {len(contents)} characters")
            return contents, filename
    except FileNotFoundError:
        print(f"❌ Error: The file '{filename}' does not exist.")
        return None, None
    except Exception as e:
        print(f"❌ An error occurred: {e}")
        return None, None

def parse_search_results(html_content):
    """Parse Google search results using Gemini and return structured JSON"""
    prompt = f"""
    You are an expert Google search results parser. Parse the HTML content and return ONLY a valid JSON object.

    IMPORTANT: Your response must start with {{ and end with }} - NO other text, explanation, or formatting.

    Extract search results from the HTML into this exact JSON structure:

    RESULT TYPES TO IDENTIFY AND EXTRACT:
    1. **Organic Results** - Regular web search results
    2. **Ads/Sponsored** - Paid advertisements
    3. **Featured Snippets** - Answer boxes at top
    4. **Knowledge Panels** - Info boxes (usually right side)
    5. **Local Business Results** - Maps/local listings with ratings, hours, phone
    6. **YouTube Videos** - Video results from YouTube
    7. **Images** - Image search results
    8. **News** - News articles with date/source
    9. **Shopping** - Product listings with prices
    10. **Recipes** - Recipe cards with ratings, cook time
    11. **People Also Ask** - Expandable question boxes
    12. **Related Searches** - Suggested searches at bottom
    13. **Twitter/Social** - Social media posts
    14. **Academic** - Scholar/research papers
    15. **Apps** - Mobile app results
    16. **Events** - Event listings with dates
    17. **Jobs** - Job postings
    18. **Flights** - Flight information
    19. **Hotels** - Hotel listings
    20. **Weather** - Weather information
    21. **Forum Results** - Discussion threads or posts from forums (e.g., Reddit, Quora, Stack Overflow)

    Return ONLY valid JSON in this exact structure (no markdown, no explanation):
    {{
        "search_metadata": {{
            "query": "extracted search query",
            "total_results": "estimated number if shown",
            "search_time": "time taken if shown",
            "page_title": "page title"
        }},
        "featured_snippets": [
            {{
                "rank": 1,
                "title": "snippet title",
                "content": "snippet content",
                "url": "source URL",
                "domain": "domain.com",
                "type": "featured_snippet"
            }}
        ],
        "knowledge_panel": {{
            "title": "panel title",
            "description": "description",
            "facts": {{}},
            "images": [],
            "source": "source",
            "type": "knowledge_panel"
        }},
        "organic_results": [
            {{
                "rank": 1,
                "title": "page title",
                "url": "full URL",
                "snippet": "description text",
                "domain": "domain.com",
                "breadcrumbs": "breadcrumb path if available",
                "date": "published date if available",
                "type": "organic"
            }}
        ],
        "local_business": [
            {{
                "rank": 1,
                "name": "business name",
                "rating": "4.5",
                "reviews_count": "123",
                "address": "full address",
                "phone": "phone number",
                "hours": "operating hours",
                "website": "website URL",
                "type": "local_business"
            }}
        ],
        "videos": [
            {{
                "rank": 1,
                "title": "video title",
                "url": "video URL",
                "thumbnail": "thumbnail URL",
                "duration": "video length",
                "views": "view count",
                "channel": "channel name",
                "upload_date": "upload date",
                "platform": "YouTube/Vimeo/etc",
                "type": "video"
            }}
        ],
        "images": [
            {{
                "rank": 1,
                "title": "image title",
                "url": "image URL",
                "source_url": "source page URL",
                "source_domain": "domain.com",
                "dimensions": "width x height",
                "type": "image"
            }}
        ],
        "news": [
            {{
                "rank": 1,
                "title": "news title",
                "url": "article URL",
                "snippet": "article snippet",
                "source": "news source",
                "date": "publish date",
                "thumbnail": "thumbnail URL if available",
                "type": "news"
            }}
        ],
        "shopping": [
            {{
                "rank": 1,
                "title": "product name",
                "price": "price",
                "rating": "rating if available",
                "reviews": "review count",
                "store": "store name",
                "url": "product URL",
                "image": "product image URL",
                "type": "shopping"
            }}
        ],
        "recipes": [
            {{
                "rank": 1,
                "title": "recipe name",
                "rating": "rating",
                "cook_time": "cooking time",
                "ingredients_count": "number of ingredients",
                "url": "recipe URL",
                "image": "recipe image",
                "source": "recipe source",
                "type": "recipe"
            }}
        ],
        "people_also_ask": [
            {{
                "question": "question text",
                "answer": "answer text if expanded",
                "source_url": "source URL",
                "type": "people_also_ask"
            }}
        ],
        "ads": [
            {{
                "rank": 1,
                "title": "ad title",
                "url": "ad URL",
                "snippet": "ad description",
                "domain": "advertiser domain",
                "position": "top/bottom/side",
                "type": "advertisement"
            }}
        ],
        "related_searches": [
            {{
                "query": "related search query",
                "type": "related_search"
            }}
        ],
        "social_media": [
            {{
                "rank": 1,
                "title": "post title/content",
                "url": "post URL",
                "platform": "Twitter/Facebook/etc",
                "author": "author name",
                "date": "post date",
                "type": "social_media"
            }}
        ],
        "apps": [
            {{
                "rank": 1,
                "name": "app name",
                "rating": "app rating",
                "downloads": "download count",
                "price": "app price",
                "platform": "iOS/Android",
                "url": "app store URL",
                "type": "app"
            }}
        ],
        "events": [
            {{
                "rank": 1,
                "title": "event name",
                "date": "event date",
                "location": "event location",
                "url": "event URL",
                "type": "event"
            }}
        ],
        "jobs": [
            {{
                "rank": 1,
                "title": "job title",
                "company": "company name",
                "location": "job location",
                "salary": "salary if available",
                "url": "job URL",
                "type": "job"
            }}
        ],
        "weather": {{
            "location": "weather location",
            "current_temp": "current temperature",
            "condition": "weather condition",
            "forecast": [],
            "type": "weather"
        }},
        "flights": [
            {{
                "airline": "airline name",
                "price": "flight price",
                "duration": "flight duration",
                "departure": "departure info",
                "arrival": "arrival info",
                "type": "flight"
            }}
        ],
        "forum_results": [
            {{
                "rank": 1,
                "title": "forum thread title",
                "url": "thread URL",
                "snippet": "thread preview text",
                "domain": "forum domain (e.g., reddit.com)",
                "author": "post author if available",
                "date": "post date if available",
                "replies_count": "number of replies if available",
                "type": "forum"
            }}
        ]
    }}

    CRITICAL PARSING RULES:
    1. Return ONLY the JSON object, no other text
    2. If a section has no results, use empty array [] or empty object {{}}
    3. Always include "rank" starting from 1 for ordered results
    4. Extract domain from URLs (e.g., "apple.com" from "https://www.apple.com/page")
    5. Clean all text (remove extra spaces, newlines, HTML entities)
    6. For ratings, extract numbers (e.g., "4.5" from "4.5 stars" or "★★★★☆")
    7. For dates, extract in readable format
    8. For prices, include currency symbol if available
    9. Always identify the correct result type based on content and context
    10. Maintain ranking order as they appear on the page (top to bottom)

    HTML Content to analyze:
    {html_content}
    """

    try:
        print("🔄 Parsing HTML content with Gemini AI...")

        # Use the correct client API format
        response = client.models.generate_content(
            model="gemini-2.5-pro-preview-06-05",
            contents=prompt
        )

        # Get the response text
        raw_response = response.text.strip()
        print("✓ Gemini API response received!")
        print(f"Response length: {len(raw_response)} characters")

        # Clean the response - remove markdown formatting if present
        cleaned_response = raw_response
        if cleaned_response.startswith('```json'):
            cleaned_response = cleaned_response[7:]
        if cleaned_response.endswith('```'):
            cleaned_response = cleaned_response[:-3]
        cleaned_response = cleaned_response.strip()

        # Show first 200 chars of response for debugging
        print(f"First 200 chars: {cleaned_response[:200]}...")

        # Try to parse as JSON
        try:
            parsed_data = json.loads(cleaned_response)
            print("✓ JSON parsing successful!")
            return parsed_data, raw_response
        except json.JSONDecodeError as e:
            print(f"❌ JSON parsing failed: {e}")
            print("Attempting to fix common JSON issues...")

            # Try to extract JSON from response if it's embedded in text
            json_start = cleaned_response.find('{')
            json_end = cleaned_response.rfind('}') + 1

            if json_start != -1 and json_end > json_start:
                json_part = cleaned_response[json_start:json_end]
                try:
                    parsed_data = json.loads(json_part)
                    print("✓ JSON parsing successful after extraction!")
                    return parsed_data, raw_response
                except json.JSONDecodeError:
                    print("❌ Still couldn't parse JSON after extraction")

            print("Raw response will be saved for manual inspection.")
            return None, raw_response

    except Exception as e:
        print(f"❌ Error calling Gemini API: {e}")
        return None, str(e)

def save_json_files(parsed_data, raw_response, original_filename):
    """Save both parsed JSON and raw response to files"""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    base_name = os.path.splitext(original_filename)[0]

    # Save raw response
    raw_filename = f"{base_name}_raw_{timestamp}.json"
    try:
        with open(raw_filename, 'w', encoding='utf-8') as f:
            if raw_response:
                # Try to format as JSON if possible, otherwise save as text
                try:
                    formatted_raw = json.dumps(json.loads(raw_response), indent=2, ensure_ascii=False)
                    f.write(formatted_raw)
                except:
                    f.write(raw_response)
            else:
                f.write("No raw response available")
        print(f"✓ Raw response saved to: {raw_filename}")
    except Exception as e:
        print(f"❌ Error saving raw file: {e}")

    # Save parsed JSON
    if parsed_data:
        parsed_filename = f"{base_name}_parsed_{timestamp}.json"
        try:
            with open(parsed_filename, 'w', encoding='utf-8') as f:
                json.dump(parsed_data, f, indent=2, ensure_ascii=False)
            print(f"✓ Parsed data saved to: {parsed_filename}")
            return parsed_filename, raw_filename
        except Exception as e:
            print(f"❌ Error saving parsed file: {e}")

    return None, raw_filename

def display_sample_results(parsed_data):
    """Display sample results from the parsed data"""
    if not parsed_data:
        print("❌ No parsed data to display")
        return

    print("n" + "="*60)
    print("📊 SAMPLE RESULTS SUMMARY")
    print("="*60)

    # Search metadata
    if 'search_metadata' in parsed_data:
        metadata = parsed_data['search_metadata']
        print(f"n🔍 Search Query: {metadata.get('query', 'N/A')}")
        print(f"📈 Total Results: {metadata.get('total_results', 'N/A')}")
        print(f"⏱️  Search Time: {metadata.get('search_time', 'N/A')}")

    # Count results by type
    result_counts = {}
    for key, value in parsed_data.items():
        if key != 'search_metadata' and isinstance(value, list):
            result_counts[key] = len(value)
        elif key != 'search_metadata' and isinstance(value, dict) and value:
            result_counts[key] = 1

    print(f"n📋 RESULTS BREAKDOWN:")
    for result_type, count in result_counts.items():
        if count > 0:
            print(f"   • {result_type.replace('_', ' ').title()}: {count}")

    # Show sample organic results
    if 'organic_results' in parsed_data and parsed_data['organic_results']:
        print(f"n🌐 SAMPLE ORGANIC RESULTS (Top 3):")
        for i, result in enumerate(parsed_data['organic_results'][:3], 1):
            print(f"n   {i}. {result.get('title', 'No title')}")
            print(f"      URL: {result.get('url', 'No URL')}")
            print(f"      Domain: {result.get('domain', 'No domain')}")
            snippet = result.get('snippet', 'No snippet')
            if len(snippet) > 100:
                snippet = snippet[:100] + "..."
            print(f"      Snippet: {snippet}")

    # Show featured snippets
    if 'featured_snippets' in parsed_data and parsed_data['featured_snippets']:
        print(f"n⭐ FEATURED SNIPPETS:")
        for snippet in parsed_data['featured_snippets']:
            print(f"   Title: {snippet.get('title', 'No title')}")
            content = snippet.get('content', 'No content')
            if len(content) > 150:
                content = content[:150] + "..."
            print(f"   Content: {content}")
            print(f"   Source: {snippet.get('domain', 'No domain')}")

    # Show local business results
    if 'local_business' in parsed_data and parsed_data['local_business']:
        print(f"n🏪 LOCAL BUSINESS RESULTS:")
        for business in parsed_data['local_business'][:2]:
            print(f"   • {business.get('name', 'No name')}")
            print(f"     Rating: {business.get('rating', 'N/A')} ({business.get('reviews_count', 'N/A')} reviews)")
            print(f"     Address: {business.get('address', 'No address')}")

    # Show people also ask
    if 'people_also_ask' in parsed_data and parsed_data['people_also_ask']:
        print(f"n❓ PEOPLE ALSO ASK (Top 3):")
        for qa in parsed_data['people_also_ask'][:3]:
            print(f"   • {qa.get('question', 'No question')}")

    print("n" + "="*60)

def main():
    """Main function to orchestrate the parsing process"""
    print("🚀 Google Search Results Parser")
    print("="*40)

    # Read HTML file
    html_content, filename = read_html_file()
    if not html_content:
        return

    # Parse with Gemini
    parsed_data, raw_response = parse_search_results(html_content)

    # Save JSON files
    parsed_file, raw_file = save_json_files(parsed_data, raw_response, filename)

    # Display sample results
    if parsed_data:
        display_sample_results(parsed_data)
    else:
        print("❌ Could not parse results. Check the raw response file for details.")

    print(f"n📁 FILES CREATED:")
    print(f"   • Raw response: {raw_file}")
    if parsed_file:
        print(f"   • Parsed JSON: {parsed_file}")

    print("n✅ Process completed!")

if __name__ == "__main__":
    main()

To make the code accessible and understandable, let’s break down its five most critical components. These elements highlight how the Gemini 2.5 Pro model transforms raw HTML into structured data and why this approach is so effective.

Obtaining and Using the Gemini API Key

What It Does: The code requires a Gemini API key to authenticate requests to the Gemini 2.5 Pro model.

Code Snippet:

client = genai.Client(api_key="YOUR_API_KEY_HERE")

Why It Matters: The API key is your gateway to accessing Gemini’s powerful parsing capabilities. Without it, you can’t connect to the model. You can easily generate an API key by visiting Gemini AI Studio, signing in with your Google account, and following the instructions to create a key. Be sure to keep your API key secure and avoid hardcoding it in production code—use environment variables (e.g., os.environ[“GEMINI_API_KEY”]) for safety.

Crafting a High-Quality Prompt for Better Results

What It Does: The prompt in the parse_search_results function is a detailed instruction set that tells Gemini 2.5 Pro how to parse the HTML and structure the output as JSON.

Code Snippet:

prompt = f"""
You are an expert Google search results parser...
Return ONLY valid JSON in this exact structure...
{{
    "search_metadata": {{...}},
    "featured_snippets": [...],
    ...
}}
HTML Content to analyze:
{html_content}
"""

Why It Matters: The prompt is the most critical part of the code because a better prompt yields better results. It specifies 21 result types (e.g., organic, ads, videos) and enforces rules like cleaning text and maintaining ranking order. Users can modify the prompt to extract additional fields (e.g., specific metadata) or adapt it for other HTML parsing tasks, making the tool highly customizable.

Requiring Pre-Scraped HTML Input

What It Does: The script expects an HTML file containing Google search results, which it reads and passes to Gemini for parsing.

Code Snippet:

def read_html_file():
    """Read HTML file from user input"""
    filename = input("Enter the HTML file name (with extension, e.g., 'search_results.html'): ").strip()
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            contents = file.read()
            print(f"✓ File '{filename}' read successfully!")
            print(f"File size: {len(contents)} characters")
            return contents, filename
    except FileNotFoundError:
        print(f"❌ Error: The file '{filename}' does not exist.")
        return None, None
    except Exception as e:
        print(f"❌ An error occurred: {e}")
        return None, None

Why It Matters: This tutorial focuses exclusively on parsing HTML using Gemini 2.5 Pro, not on scraping Google search results. Users must provide pre-scraped HTML, which they can obtain using web scraping tools like Python’s requests,BeautifulSoup, Selenium. For detailed scraping instructions, refer to my tutorial: How to Scrape Google Search Results with Python .

Generating Raw and Parsed JSON Outputs

What It Does: The script produces two JSON files: parsed_results_*.json (the structured output) and raw_response_*.json (the raw AI response for debugging).

Code Snippet:

# Save raw response
 raw_filename = f"{base_name}_raw_{timestamp}.json"
 try:
     with open(raw_filename, 'w', encoding='utf-8') as f:
         if raw_response:
             # Try to format as JSON if possible, otherwise save as text
             try:
                 formatted_raw = json.dumps(json.loads(raw_response), indent=2, ensure_ascii=False)
                 f.write(formatted_raw)
             except:
                 f.write(raw_response)
         else:
             f.write("No raw response available")
     print(f"✓ Raw response saved to: {raw_filename}")
 except Exception as e:
     print(f"❌ Error saving raw file: {e}")

 # Save parsed JSON
 if parsed_data:
     parsed_filename = f"{base_name}_parsed_{timestamp}.json"
     try:
         with open(parsed_filename, 'w', encoding='utf-8') as f:
             json.dump(parsed_data, f, indent=2, ensure_ascii=False)
         print(f"✓ Parsed data saved to: {parsed_filename}")
         return parsed_filename, raw_filename
     except Exception as e:
         print(f"❌ Error saving parsed file: {e}")

 return None, raw_filename

Why It Matters: The parsed_results_*.json file contains the structured data (e.g., organic results, featured snippets) in the format specified by the prompt, ready for analysis or integration. The raw_response_*.json file captures Gemini’s unprocessed response, including any errors, which is invaluable for debugging if the parsing fails or produces unexpected results. This dual-output approach ensures transparency and reliability.

here is screenshot how look like the json result:

json

Here’s a recap of everything from start to finish:

Prerequisites

Install the google-generativeai package:
pip install google-generativeai
Get your Gemini API key from Gemini AI Studio.
Prepare an HTML file containing Google search results (refer to my scraping tutorial for guidance).

Running the Script

Replace "YOUR_API_KEY_HERE" with your actual API key.
Run the script and enter the name of your HTML file (e.g., search_results.html).
The script will generate:
- parsed_results_*.json: structured output
- raw_response_*.json: raw AI response

Output

Parsed Results JSON: Structured data for all detected result types.
Raw Response JSON: Gemini’s unprocessed response, useful for debugging.
Console Summary: A quick overview of the results, including query metadata and result type counts.

Responses

You must be logged in to post a comment.