{"id":3110,"date":"2024-12-23T15:25:03","date_gmt":"2024-12-23T15:25:03","guid":{"rendered":"https:\/\/rayobyte.com\/community\/?post_type=scraping_project&#038;p=3110"},"modified":"2024-12-23T15:32:48","modified_gmt":"2024-12-23T15:32:48","slug":"how-to-create-a-yahoo-scraper-in-python-for-search-data","status":"publish","type":"scraping_project","link":"https:\/\/rayobyte.com\/community\/scraping-project\/how-to-create-a-yahoo-scraper-in-python-for-search-data\/","title":{"rendered":"How to Create a Yahoo Scraper in Python for Search Data"},"content":{"rendered":"<h1>Learn how to build a Yahoo scraper in Python to extract search data, including titles and descriptions. Step-by-step guide with source code.<\/h1>\n<p>Yahoo&#8217;s search engine offers unique insights and opportunities for data extraction. In this tutorial, we&#8217;ll guide you through building a Yahoo scraper using Python. You&#8217;ll learn how to scrape search results, including titles, links, and descriptions, to gather valuable data from Yahoo&#8217;s search engine.<\/p>\n<h2>Table of Content<\/h2>\n<p><a href=\"#introduction\">Introduction<\/a><br \/><a href=\"#prerequisites\">Prerequisites<\/a><br \/><a href=\"#step1\">Step 1: Setting Up the Yahoo Scraper<\/a><br \/><a href=\"#step2\">Step 2: Parsing the HTML Content<\/a><br \/><a href=\"#step3\">Step 3: Saving Data to CSV<\/a><br \/><a href=\"#step4\">Step 4: Running the Scraper<\/a><br \/><a href=\"#output\">Expected Output<\/a><br \/><a href=\"#best-practices\">Best Practices for Scraping<\/a><br \/><a href=\"#conclusion\">Conclusion<\/a><\/p>\n<h2 id=\"introduction\">Introduction<\/h2>\n<p>Web scraping has become an essential skill for extracting valuable information from the web. Whether you&#8217;re collecting data for research, market analysis, or building a search engine aggregator, web scraping allows you to automate data extraction efficiently. In this tutorial, we\u2019ll walk you through creating a<strong> Yahoo scraper in Python<\/strong>. Using Python\u2019s robust libraries like <strong>requests<\/strong> and <strong>BeautifulSoup<\/strong>, we\u2019ll demonstrate how to fetch and parse search results from Yahoo.<\/p>\n<h2 id=\"prerequisites\">Prerequisites<\/h2>\n<p>Before diving in, ensure you have the following:<\/p>\n<ul>\n<li>Python: Install Python 3.6 or higher.<\/li>\n<li>Libraries: Install the required Python libraries by running the command below:<\/li>\n<\/ul>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">pip install requests beautifulsoup4 pandas<\/pre>\n<ul>\n<li>Text Editor or IDE: Use your preferred development environment (e.g., VSCode, PyCharm, or Jupyter Notebook).<\/li>\n<\/ul>\n<h2 id=\"step1\">Step 1: Setting Up the Yahoo Scraper<\/h2>\n<p>First, let\u2019s create a function to fetch search results from Yahoo. We&#8217;ll use the <code>requests<\/code> library to send an HTTP <strong>GET<\/strong> request and retrieve the HTML content of the search results page.<\/p>\n<p>Code:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import requests\nfrom bs4 import BeautifulSoup\nimport pandas as pd\nimport re\n\n\n# Function to fetch Yahoo search results\ndef fetch_yahoo_search_results(query, start=0):\n    headers = {\n        'User-Agent': 'Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/114.0.0.0 Safari\/537.36'\n    }\n    base_url = \"https:\/\/search.yahoo.com\/search\"\n    params = {\n        'p': query,  # The search query\n        'b': start + 1  # Starting position of the results (1-based index)\n    }\n\n    response = requests.get(base_url, headers=headers, params=params)\n\n    if response.status_code == 200:\n        return response.text\n    else:\n        print(f\"Failed to fetch results. HTTP Status Code: {response.status_code}\")\n        return None<\/pre>\n<h2 id=\"step2\">Step 2: Parsing the HTML Content<\/h2>\n<p>Next, we\u2019ll parse the HTML content using <code>BeautifulSoup<\/code> to extract search results. The titles, links, and descriptions are typically located within specific HTML tags or classes.<\/p>\n<p>Code:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\"># Function to parse the HTML content\ndef parse_search_results(html_content):\n    soup = BeautifulSoup(html_content, 'html.parser')\n    search_results = []\n\n    for result in soup.select('.Sr'):  # Yahoo search result container class\n        title_tag = result.select_one('h3')\n        link_tag = title_tag.a if title_tag else None\n        description_tag = result.select_one('p')\n        title = link_tag['aria-label'] if link_tag else 'N\/A'\n        link = link_tag['href'] if link_tag else 'N\/A'\n        description = description_tag.text if description_tag else 'N\/A'\n\n        # Remove &lt;b&gt; and &lt;\/b&gt; tags from the title\n        title = re.sub(r'&lt;\/?b&gt;', '', title)\n\n        search_results.append({\n            'Title': title,\n            'Link': link,\n            'Description': description\n        })\n\n    return search_results<\/pre>\n<h2 id=\"step3\">Step 3: Saving Data to CSV<\/h2>\n<p>After extracting the data, we&#8217;ll save it to a CSV file using pandas for easy analysis and sharing.<\/p>\n<p>Code:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\"># Function to save data to a CSV file\ndef save_to_csv(data, filename=\"yahoo_search_results.csv\"):\n    df = pd.DataFrame(data)\n    df.to_csv(filename, index=False)\n    print(f\"Data saved to {filename}\")<\/pre>\n<h2 id=\"step4\">Step 4: Running the Scraper<\/h2>\n<p>Finally, we\u2019ll combine all the functions to fetch multiple pages of search results, parse them, and save the data.<\/p>\n<p>Code:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">if __name__ == \"__main__\":\n    query = \"Python\"\n    num_pages = 5  # Number of pages to fetch\n\n    all_search_results = []\n\n    for page in range(num_pages):\n        start = page * 10  # Assuming 10 results per page\n        print(f\"Fetching search results for page {page + 1}...\")\n        html_content = fetch_yahoo_search_results(query, start)\n\n        if html_content:\n            print(\"Parsing search results...\")\n            search_results = parse_search_results(html_content)\n            all_search_results.extend(search_results)\n        else:\n            print(\"Failed to scrape Yahoo search results.\")\n            break\n\n    print(\"Saving results to CSV...\")\n    save_to_csv(all_search_results)\n\n    print(\"Yahoo scraping completed successfully!\")<\/pre>\n<h2 id=\"output\">Expected Output<\/h2>\n<p>Once you run the scraper, a CSV file named <code>yahoo_search_results.csv<\/code> will be created in your working directory. Since we set <code>num_pages = 5<\/code>, you will get a total of 35 results (5 results per page across 5 pages). Here\u2019s an example of what the contents of the CSV file might look like:<br \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3111\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-22-190322.png\" alt=\"yahoo search scraper results\" width=\"1645\" height=\"971\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-22-190322.png 1645w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-22-190322-300x177.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-22-190322-1024x604.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-22-190322-768x453.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-22-190322-1536x907.png 1536w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/12\/Screenshot-2024-12-22-190322-624x368.png 624w\" sizes=\"auto, (max-width: 1645px) 100vw, 1645px\" \/><\/p>\n<h2 id=\"best-practices\">Best Practices for Scraping<\/h2>\n<ol>\n<li><strong>Add Delays Between Requests<\/strong><\/li>\n<\/ol>\n<p>Websites often monitor the frequency of requests to prevent scraping. Adding small delays between actions can mimic human behavior and reduce the risk of getting blocked.<\/p>\n<p><strong>2. Use Proxy Rotation<\/strong><\/p>\n<p>For scraping a large number of reviews, especially across multiple businesses, proxy rotation is essential. It ensures your requests originate from different IPs, avoiding detection and blocking. <a href=\"https:\/\/rayobyte.com\/products\/residential-proxies\/\">Rayobyte<\/a> offers reliable proxy services.<\/p>\n<p>Proxy Rotation Setup with <a href=\"https:\/\/rayobyte.com\/products\/residential-proxies\/\">Rayobyte<\/a>.<\/p>\n<p>To integrate proxies into the setup, we need to add proxies to <code>fetch_yahoo_search_results<\/code> function:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import requests\nfrom bs4 import BeautifulSoup\nimport pandas as pd\nimport re\nfrom dotenv import load_dotenv\nimport os\n\n# Load environment variables from .env file\nload_dotenv()\n\n# Get proxy from environment variables\nproxy = os.getenv('PROXY')\n\n# Function to fetch Yahoo search results\ndef fetch_yahoo_search_results(query, start=0):\n    headers = {\n        'User-Agent': 'Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/114.0.0.0 Safari\/537.36'\n    }\n    base_url = \"https:\/\/search.yahoo.com\/search\"\n    params = {\n        'p': query,  # The search query\n        'b': start + 1  # Starting position of the results (1-based index)\n    }\n\n    proxies = {\n        'http': proxy,\n        'https': proxy\n    }\n\n    response = requests.get(base_url, headers=headers, params=params, proxies=proxies)\n\n    if response.status_code == 200:\n        return response.text\n    else:\n        print(f\"Failed to fetch results. HTTP Status Code: {response.status_code}\")\n        return None<\/pre>\n<p>This will call the proxies that are stored in <code>.env<\/code> file.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">PROXY=http:\/\/your-proxy-url:port<\/pre>\n<p><strong>3. Respect Website Terms of Service<\/strong><\/p>\n<p>Before scraping, always check the website\u2019s Terms of Service. Use the data responsibly and ensure compliance with local laws and regulations.<\/p>\n<h2 id=\"conclusion\">Conclusion<\/h2>\n<p>Congratulations! You\u2019ve successfully built a Yahoo scraper in Python. This script allows you to fetch and parse Yahoo search results, including titles, links, and descriptions, and save them to a CSV file for further analysis. With minor modifications, this scraper can be adapted to other use cases, such as fetching financial data or building search engine aggregators.<\/p>\n<p>Feel free to experiment with the code and customize it for your needs. If you encounter any issues or have questions while following this tutorial, feel free to leave a comment below. I\u2019d be happy to help you troubleshoot and provide additional guidance!<\/p>\n<p>Happy scraping!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learn how to build a Yahoo scraper in Python to extract search data, including titles and descriptions. Step-by-step guide with source code. Yahoo&#8217;s search engine&hellip;<\/p>\n","protected":false},"author":25,"featured_media":3113,"comment_status":"open","ping_status":"closed","template":"","meta":{"rank_math_lock_modified_date":false},"categories":[],"class_list":["post-3110","scraping_project","type-scraping_project","status-publish","has-post-thumbnail","hentry"],"_links":{"self":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/scraping_project\/3110","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/scraping_project"}],"about":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/types\/scraping_project"}],"author":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/users\/25"}],"replies":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/comments?post=3110"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/media\/3113"}],"wp:attachment":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/media?parent=3110"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/categories?post=3110"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}