How to Parse Google Search Results: A Complete Guide with AI-Powered Extraction
Watch the full tutorial on YouTube
How to Parse Google Search Results: A Complete Guide with AI-Powered Extraction
Why Use the Gemini 2.5 Pro AI Model?
-
Large Input Capacity: Gemini 2.5 Pro can process extensive HTML content, making it ideal for analyzing Google’s SERPs, which often include diverse elements like ads, knowledge panels, and local business listings.
-
Advanced Contextual Understanding: Its state-of-the-art natural language processing (NLP) capabilities allow it to interpret HTML structures and categorize result types (e.g., organic results, videos, recipes) with precision.
-
Structured JSON Output: Gemini 2.5 Pro can generate clean, machine-readable JSON based on detailed prompts, ensuring the extracted data is ready for analysis or integration into other tools.
-
Robust Handling of Variability: It adapts to inconsistencies in HTML structure, reducing the need for manual preprocessing or fragile parsing rules.
-
Efficiency and Scalability: The model’s API is optimized for processing large datasets, making it suitable for projects requiring bulk SERP analysis.
The Code: Parsing Google Search Results with Gemini
import json import os from datetime import datetime from google import genai # Configure Gemini API client = genai.Client(api_key="AIzaSyASRXxmroFCVGdrSr2B3H8U8izbvk8L7u4") def read_html_file(): """Read HTML file from user input""" filename = input("Enter the HTML file name (with extension, e.g., 'search_results.html'): ").strip() try: with open(filename, 'r', encoding='utf-8') as file: contents = file.read() print(f"✓ File '{filename}' read successfully!") print(f"File size: {len(contents)} characters") return contents, filename except FileNotFoundError: print(f"❌ Error: The file '{filename}' does not exist.") return None, None except Exception as e: print(f"❌ An error occurred: {e}") return None, None def parse_search_results(html_content): """Parse Google search results using Gemini and return structured JSON""" prompt = f""" You are an expert Google search results parser. Parse the HTML content and return ONLY a valid JSON object. IMPORTANT: Your response must start with {{ and end with }} - NO other text, explanation, or formatting. Extract search results from the HTML into this exact JSON structure: RESULT TYPES TO IDENTIFY AND EXTRACT: 1. **Organic Results** - Regular web search results 2. **Ads/Sponsored** - Paid advertisements 3. **Featured Snippets** - Answer boxes at top 4. **Knowledge Panels** - Info boxes (usually right side) 5. **Local Business Results** - Maps/local listings with ratings, hours, phone 6. **YouTube Videos** - Video results from YouTube 7. **Images** - Image search results 8. **News** - News articles with date/source 9. **Shopping** - Product listings with prices 10. **Recipes** - Recipe cards with ratings, cook time 11. **People Also Ask** - Expandable question boxes 12. **Related Searches** - Suggested searches at bottom 13. **Twitter/Social** - Social media posts 14. **Academic** - Scholar/research papers 15. **Apps** - Mobile app results 16. **Events** - Event listings with dates 17. **Jobs** - Job postings 18. **Flights** - Flight information 19. **Hotels** - Hotel listings 20. **Weather** - Weather information 21. **Forum Results** - Discussion threads or posts from forums (e.g., Reddit, Quora, Stack Overflow) Return ONLY valid JSON in this exact structure (no markdown, no explanation): {{ "search_metadata": {{ "query": "extracted search query", "total_results": "estimated number if shown", "search_time": "time taken if shown", "page_title": "page title" }}, "featured_snippets": [ {{ "rank": 1, "title": "snippet title", "content": "snippet content", "url": "source URL", "domain": "domain.com", "type": "featured_snippet" }} ], "knowledge_panel": {{ "title": "panel title", "description": "description", "facts": {{}}, "images": [], "source": "source", "type": "knowledge_panel" }}, "organic_results": [ {{ "rank": 1, "title": "page title", "url": "full URL", "snippet": "description text", "domain": "domain.com", "breadcrumbs": "breadcrumb path if available", "date": "published date if available", "type": "organic" }} ], "local_business": [ {{ "rank": 1, "name": "business name", "rating": "4.5", "reviews_count": "123", "address": "full address", "phone": "phone number", "hours": "operating hours", "website": "website URL", "type": "local_business" }} ], "videos": [ {{ "rank": 1, "title": "video title", "url": "video URL", "thumbnail": "thumbnail URL", "duration": "video length", "views": "view count", "channel": "channel name", "upload_date": "upload date", "platform": "YouTube/Vimeo/etc", "type": "video" }} ], "images": [ {{ "rank": 1, "title": "image title", "url": "image URL", "source_url": "source page URL", "source_domain": "domain.com", "dimensions": "width x height", "type": "image" }} ], "news": [ {{ "rank": 1, "title": "news title", "url": "article URL", "snippet": "article snippet", "source": "news source", "date": "publish date", "thumbnail": "thumbnail URL if available", "type": "news" }} ], "shopping": [ {{ "rank": 1, "title": "product name", "price": "price", "rating": "rating if available", "reviews": "review count", "store": "store name", "url": "product URL", "image": "product image URL", "type": "shopping" }} ], "recipes": [ {{ "rank": 1, "title": "recipe name", "rating": "rating", "cook_time": "cooking time", "ingredients_count": "number of ingredients", "url": "recipe URL", "image": "recipe image", "source": "recipe source", "type": "recipe" }} ], "people_also_ask": [ {{ "question": "question text", "answer": "answer text if expanded", "source_url": "source URL", "type": "people_also_ask" }} ], "ads": [ {{ "rank": 1, "title": "ad title", "url": "ad URL", "snippet": "ad description", "domain": "advertiser domain", "position": "top/bottom/side", "type": "advertisement" }} ], "related_searches": [ {{ "query": "related search query", "type": "related_search" }} ], "social_media": [ {{ "rank": 1, "title": "post title/content", "url": "post URL", "platform": "Twitter/Facebook/etc", "author": "author name", "date": "post date", "type": "social_media" }} ], "apps": [ {{ "rank": 1, "name": "app name", "rating": "app rating", "downloads": "download count", "price": "app price", "platform": "iOS/Android", "url": "app store URL", "type": "app" }} ], "events": [ {{ "rank": 1, "title": "event name", "date": "event date", "location": "event location", "url": "event URL", "type": "event" }} ], "jobs": [ {{ "rank": 1, "title": "job title", "company": "company name", "location": "job location", "salary": "salary if available", "url": "job URL", "type": "job" }} ], "weather": {{ "location": "weather location", "current_temp": "current temperature", "condition": "weather condition", "forecast": [], "type": "weather" }}, "flights": [ {{ "airline": "airline name", "price": "flight price", "duration": "flight duration", "departure": "departure info", "arrival": "arrival info", "type": "flight" }} ], "forum_results": [ {{ "rank": 1, "title": "forum thread title", "url": "thread URL", "snippet": "thread preview text", "domain": "forum domain (e.g., reddit.com)", "author": "post author if available", "date": "post date if available", "replies_count": "number of replies if available", "type": "forum" }} ] }} CRITICAL PARSING RULES: 1. Return ONLY the JSON object, no other text 2. If a section has no results, use empty array [] or empty object {{}} 3. Always include "rank" starting from 1 for ordered results 4. Extract domain from URLs (e.g., "apple.com" from "https://www.apple.com/page") 5. Clean all text (remove extra spaces, newlines, HTML entities) 6. For ratings, extract numbers (e.g., "4.5" from "4.5 stars" or "★★★★☆") 7. For dates, extract in readable format 8. For prices, include currency symbol if available 9. Always identify the correct result type based on content and context 10. Maintain ranking order as they appear on the page (top to bottom) HTML Content to analyze: {html_content} """ try: print("🔄 Parsing HTML content with Gemini AI...") # Use the correct client API format response = client.models.generate_content( model="gemini-2.5-pro-preview-06-05", contents=prompt ) # Get the response text raw_response = response.text.strip() print("✓ Gemini API response received!") print(f"Response length: {len(raw_response)} characters") # Clean the response - remove markdown formatting if present cleaned_response = raw_response if cleaned_response.startswith('```json'): cleaned_response = cleaned_response[7:] if cleaned_response.endswith('```'): cleaned_response = cleaned_response[:-3] cleaned_response = cleaned_response.strip() # Show first 200 chars of response for debugging print(f"First 200 chars: {cleaned_response[:200]}...") # Try to parse as JSON try: parsed_data = json.loads(cleaned_response) print("✓ JSON parsing successful!") return parsed_data, raw_response except json.JSONDecodeError as e: print(f"❌ JSON parsing failed: {e}") print("Attempting to fix common JSON issues...") # Try to extract JSON from response if it's embedded in text json_start = cleaned_response.find('{') json_end = cleaned_response.rfind('}') + 1 if json_start != -1 and json_end > json_start: json_part = cleaned_response[json_start:json_end] try: parsed_data = json.loads(json_part) print("✓ JSON parsing successful after extraction!") return parsed_data, raw_response except json.JSONDecodeError: print("❌ Still couldn't parse JSON after extraction") print("Raw response will be saved for manual inspection.") return None, raw_response except Exception as e: print(f"❌ Error calling Gemini API: {e}") return None, str(e) def save_json_files(parsed_data, raw_response, original_filename): """Save both parsed JSON and raw response to files""" timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") base_name = os.path.splitext(original_filename)[0] # Save raw response raw_filename = f"{base_name}_raw_{timestamp}.json" try: with open(raw_filename, 'w', encoding='utf-8') as f: if raw_response: # Try to format as JSON if possible, otherwise save as text try: formatted_raw = json.dumps(json.loads(raw_response), indent=2, ensure_ascii=False) f.write(formatted_raw) except: f.write(raw_response) else: f.write("No raw response available") print(f"✓ Raw response saved to: {raw_filename}") except Exception as e: print(f"❌ Error saving raw file: {e}") # Save parsed JSON if parsed_data: parsed_filename = f"{base_name}_parsed_{timestamp}.json" try: with open(parsed_filename, 'w', encoding='utf-8') as f: json.dump(parsed_data, f, indent=2, ensure_ascii=False) print(f"✓ Parsed data saved to: {parsed_filename}") return parsed_filename, raw_filename except Exception as e: print(f"❌ Error saving parsed file: {e}") return None, raw_filename def display_sample_results(parsed_data): """Display sample results from the parsed data""" if not parsed_data: print("❌ No parsed data to display") return print("n" + "="*60) print("📊 SAMPLE RESULTS SUMMARY") print("="*60) # Search metadata if 'search_metadata' in parsed_data: metadata = parsed_data['search_metadata'] print(f"n🔍 Search Query: {metadata.get('query', 'N/A')}") print(f"📈 Total Results: {metadata.get('total_results', 'N/A')}") print(f"⏱️ Search Time: {metadata.get('search_time', 'N/A')}") # Count results by type result_counts = {} for key, value in parsed_data.items(): if key != 'search_metadata' and isinstance(value, list): result_counts[key] = len(value) elif key != 'search_metadata' and isinstance(value, dict) and value: result_counts[key] = 1 print(f"n📋 RESULTS BREAKDOWN:") for result_type, count in result_counts.items(): if count > 0: print(f" • {result_type.replace('_', ' ').title()}: {count}") # Show sample organic results if 'organic_results' in parsed_data and parsed_data['organic_results']: print(f"n🌐 SAMPLE ORGANIC RESULTS (Top 3):") for i, result in enumerate(parsed_data['organic_results'][:3], 1): print(f"n {i}. {result.get('title', 'No title')}") print(f" URL: {result.get('url', 'No URL')}") print(f" Domain: {result.get('domain', 'No domain')}") snippet = result.get('snippet', 'No snippet') if len(snippet) > 100: snippet = snippet[:100] + "..." print(f" Snippet: {snippet}") # Show featured snippets if 'featured_snippets' in parsed_data and parsed_data['featured_snippets']: print(f"n⭐ FEATURED SNIPPETS:") for snippet in parsed_data['featured_snippets']: print(f" Title: {snippet.get('title', 'No title')}") content = snippet.get('content', 'No content') if len(content) > 150: content = content[:150] + "..." print(f" Content: {content}") print(f" Source: {snippet.get('domain', 'No domain')}") # Show local business results if 'local_business' in parsed_data and parsed_data['local_business']: print(f"n🏪 LOCAL BUSINESS RESULTS:") for business in parsed_data['local_business'][:2]: print(f" • {business.get('name', 'No name')}") print(f" Rating: {business.get('rating', 'N/A')} ({business.get('reviews_count', 'N/A')} reviews)") print(f" Address: {business.get('address', 'No address')}") # Show people also ask if 'people_also_ask' in parsed_data and parsed_data['people_also_ask']: print(f"n❓ PEOPLE ALSO ASK (Top 3):") for qa in parsed_data['people_also_ask'][:3]: print(f" • {qa.get('question', 'No question')}") print("n" + "="*60) def main(): """Main function to orchestrate the parsing process""" print("🚀 Google Search Results Parser") print("="*40) # Read HTML file html_content, filename = read_html_file() if not html_content: return # Parse with Gemini parsed_data, raw_response = parse_search_results(html_content) # Save JSON files parsed_file, raw_file = save_json_files(parsed_data, raw_response, filename) # Display sample results if parsed_data: display_sample_results(parsed_data) else: print("❌ Could not parse results. Check the raw response file for details.") print(f"n📁 FILES CREATED:") print(f" • Raw response: {raw_file}") if parsed_file: print(f" • Parsed JSON: {parsed_file}") print("n✅ Process completed!") if __name__ == "__main__": main()
Obtaining and Using the Gemini API Key
-
What It Does: The code requires a Gemini API key to authenticate requests to the Gemini 2.5 Pro model.
-
Code Snippet:
client = genai.Client(api_key="YOUR_API_KEY_HERE")
-
Why It Matters: The API key is your gateway to accessing Gemini’s powerful parsing capabilities. Without it, you can’t connect to the model. You can easily generate an API key by visiting Gemini AI Studio, signing in with your Google account, and following the instructions to create a key. Be sure to keep your API key secure and avoid hardcoding it in production code—use environment variables (e.g., os.environ[“GEMINI_API_KEY”]) for safety.
Crafting a High-Quality Prompt for Better Results
-
What It Does: The prompt in the parse_search_results function is a detailed instruction set that tells Gemini 2.5 Pro how to parse the HTML and structure the output as JSON.
-
Code Snippet:
prompt = f""" You are an expert Google search results parser... Return ONLY valid JSON in this exact structure... {{ "search_metadata": {{...}}, "featured_snippets": [...], ... }} HTML Content to analyze: {html_content} """
-
Why It Matters: The prompt is the most critical part of the code because a better prompt yields better results. It specifies 21 result types (e.g., organic, ads, videos) and enforces rules like cleaning text and maintaining ranking order. Users can modify the prompt to extract additional fields (e.g., specific metadata) or adapt it for other HTML parsing tasks, making the tool highly customizable.
Requiring Pre-Scraped HTML Input
-
What It Does: The script expects an HTML file containing Google search results, which it reads and passes to Gemini for parsing.
-
Code Snippet:
def read_html_file(): """Read HTML file from user input""" filename = input("Enter the HTML file name (with extension, e.g., 'search_results.html'): ").strip() try: with open(filename, 'r', encoding='utf-8') as file: contents = file.read() print(f"✓ File '{filename}' read successfully!") print(f"File size: {len(contents)} characters") return contents, filename except FileNotFoundError: print(f"❌ Error: The file '{filename}' does not exist.") return None, None except Exception as e: print(f"❌ An error occurred: {e}") return None, None
-
Why It Matters: This tutorial focuses exclusively on parsing HTML using Gemini 2.5 Pro, not on scraping Google search results. Users must provide pre-scraped HTML, which they can obtain using web scraping tools like Python’s requests,BeautifulSoup, Selenium. For detailed scraping instructions, refer to my tutorial: How to Scrape Google Search Results with Python .
Generating Raw and Parsed JSON Outputs
-
What It Does: The script produces two JSON files: parsed_results_*.json (the structured output) and raw_response_*.json (the raw AI response for debugging).
-
Code Snippet:
# Save raw response raw_filename = f"{base_name}_raw_{timestamp}.json" try: with open(raw_filename, 'w', encoding='utf-8') as f: if raw_response: # Try to format as JSON if possible, otherwise save as text try: formatted_raw = json.dumps(json.loads(raw_response), indent=2, ensure_ascii=False) f.write(formatted_raw) except: f.write(raw_response) else: f.write("No raw response available") print(f"✓ Raw response saved to: {raw_filename}") except Exception as e: print(f"❌ Error saving raw file: {e}") # Save parsed JSON if parsed_data: parsed_filename = f"{base_name}_parsed_{timestamp}.json" try: with open(parsed_filename, 'w', encoding='utf-8') as f: json.dump(parsed_data, f, indent=2, ensure_ascii=False) print(f"✓ Parsed data saved to: {parsed_filename}") return parsed_filename, raw_filename except Exception as e: print(f"❌ Error saving parsed file: {e}") return None, raw_filename
-
Why It Matters: The parsed_results_*.json file contains the structured data (e.g., organic results, featured snippets) in the format specified by the prompt, ready for analysis or integration. The raw_response_*.json file captures Gemini’s unprocessed response, including any errors, which is invaluable for debugging if the parsing fails or produces unexpected results. This dual-output approach ensures transparency and reliability.
here is screenshot how look like the json result:
Here’s a recap of everything from start to finish:
Prerequisites
-
Install the
google-generativeai
package:pip install google-generativeai
-
Get your Gemini API key from Gemini AI Studio.
-
Prepare an HTML file containing Google search results (refer to my scraping tutorial for guidance).
Running the Script
-
Replace
"YOUR_API_KEY_HERE"
with your actual API key. -
Run the script and enter the name of your HTML file (e.g.,
search_results.html
). -
The script will generate:
-
parsed_results_*.json
: structured output -
raw_response_*.json
: raw AI response
-
Output
-
Parsed Results JSON: Structured data for all detected result types.
-
Raw Response JSON: Gemini’s unprocessed response, useful for debugging.
-
Console Summary: A quick overview of the results, including query metadata and result type counts.
Responses