How to Create a Yahoo Scraper in Python for Search Data

Learn how to build a Yahoo scraper in Python to extract search data, including titles and descriptions. Step-by-step guide with source code.

Yahoo’s search engine offers unique insights and opportunities for data extraction. In this tutorial, we’ll guide you through building a Yahoo scraper using Python. You’ll learn how to scrape search results, including titles, links, and descriptions, to gather valuable data from Yahoo’s search engine.

Table of Content

Introduction
Prerequisites
Step 1: Setting Up the Yahoo Scraper
Step 2: Parsing the HTML Content
Step 3: Saving Data to CSV
Step 4: Running the Scraper
Expected Output
Best Practices for Scraping
Conclusion

Introduction

Web scraping has become an essential skill for extracting valuable information from the web. Whether you’re collecting data for research, market analysis, or building a search engine aggregator, web scraping allows you to automate data extraction efficiently. In this tutorial, we’ll walk you through creating a Yahoo scraper in Python. Using Python’s robust libraries like requests and BeautifulSoup, we’ll demonstrate how to fetch and parse search results from Yahoo.

Prerequisites

Before diving in, ensure you have the following:

  • Python: Install Python 3.6 or higher.
  • Libraries: Install the required Python libraries by running the command below:
pip install requests beautifulsoup4 pandas
  • Text Editor or IDE: Use your preferred development environment (e.g., VSCode, PyCharm, or Jupyter Notebook).

Step 1: Setting Up the Yahoo Scraper

First, let’s create a function to fetch search results from Yahoo. We’ll use the requests library to send an HTTP GET request and retrieve the HTML content of the search results page.

Code:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re


# Function to fetch Yahoo search results
def fetch_yahoo_search_results(query, start=0):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
    }
    base_url = "https://search.yahoo.com/search"
    params = {
        'p': query,  # The search query
        'b': start + 1  # Starting position of the results (1-based index)
    }

    response = requests.get(base_url, headers=headers, params=params)

    if response.status_code == 200:
        return response.text
    else:
        print(f"Failed to fetch results. HTTP Status Code: {response.status_code}")
        return None

Step 2: Parsing the HTML Content

Next, we’ll parse the HTML content using BeautifulSoup to extract search results. The titles, links, and descriptions are typically located within specific HTML tags or classes.

Code:

# Function to parse the HTML content
def parse_search_results(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    search_results = []

    for result in soup.select('.Sr'):  # Yahoo search result container class
        title_tag = result.select_one('h3')
        link_tag = title_tag.a if title_tag else None
        description_tag = result.select_one('p')
        title = link_tag['aria-label'] if link_tag else 'N/A'
        link = link_tag['href'] if link_tag else 'N/A'
        description = description_tag.text if description_tag else 'N/A'

        # Remove <b> and </b> tags from the title
        title = re.sub(r'</?b>', '', title)

        search_results.append({
            'Title': title,
            'Link': link,
            'Description': description
        })

    return search_results

Step 3: Saving Data to CSV

After extracting the data, we’ll save it to a CSV file using pandas for easy analysis and sharing.

Code:

# Function to save data to a CSV file
def save_to_csv(data, filename="yahoo_search_results.csv"):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

Step 4: Running the Scraper

Finally, we’ll combine all the functions to fetch multiple pages of search results, parse them, and save the data.

Code:

if __name__ == "__main__":
    query = "Python"
    num_pages = 5  # Number of pages to fetch

    all_search_results = []

    for page in range(num_pages):
        start = page * 10  # Assuming 10 results per page
        print(f"Fetching search results for page {page + 1}...")
        html_content = fetch_yahoo_search_results(query, start)

        if html_content:
            print("Parsing search results...")
            search_results = parse_search_results(html_content)
            all_search_results.extend(search_results)
        else:
            print("Failed to scrape Yahoo search results.")
            break

    print("Saving results to CSV...")
    save_to_csv(all_search_results)

    print("Yahoo scraping completed successfully!")

Expected Output

Once you run the scraper, a CSV file named yahoo_search_results.csv will be created in your working directory. Since we set num_pages = 5, you will get a total of 35 results (5 results per page across 5 pages). Here’s an example of what the contents of the CSV file might look like:
yahoo search scraper results

Best Practices for Scraping

  1. Add Delays Between Requests

Websites often monitor the frequency of requests to prevent scraping. Adding small delays between actions can mimic human behavior and reduce the risk of getting blocked.

2. Use Proxy Rotation

For scraping a large number of reviews, especially across multiple businesses, proxy rotation is essential. It ensures your requests originate from different IPs, avoiding detection and blocking. Rayobyte offers reliable proxy services.

Proxy Rotation Setup with Rayobyte.

To integrate proxies into the setup, we need to add proxies to fetch_yahoo_search_results function:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Get proxy from environment variables
proxy = os.getenv('PROXY')

# Function to fetch Yahoo search results
def fetch_yahoo_search_results(query, start=0):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
    }
    base_url = "https://search.yahoo.com/search"
    params = {
        'p': query,  # The search query
        'b': start + 1  # Starting position of the results (1-based index)
    }

    proxies = {
        'http': proxy,
        'https': proxy
    }

    response = requests.get(base_url, headers=headers, params=params, proxies=proxies)

    if response.status_code == 200:
        return response.text
    else:
        print(f"Failed to fetch results. HTTP Status Code: {response.status_code}")
        return None

This will call the proxies that are stored in .env file.

PROXY=http://your-proxy-url:port

3. Respect Website Terms of Service

Before scraping, always check the website’s Terms of Service. Use the data responsibly and ensure compliance with local laws and regulations.

Conclusion

Congratulations! You’ve successfully built a Yahoo scraper in Python. This script allows you to fetch and parse Yahoo search results, including titles, links, and descriptions, and save them to a CSV file for further analysis. With minor modifications, this scraper can be adapted to other use cases, such as fetching financial data or building search engine aggregators.

Feel free to experiment with the code and customize it for your needs. If you encounter any issues or have questions while following this tutorial, feel free to leave a comment below. I’d be happy to help you troubleshoot and provide additional guidance!

Happy scraping!

Responses

Related Projects

Bing search 1
b9929b09 167f 4365 9087 fddf3278a679
Google Maps
DALL·E 2024 12 05 18.44.15 A visually appealing banner image for a blog titled Scrape Google Trends Data Using Python. The image features a laptop displaying Google Trends on