Build a YouTube Scraper in Python to Extract Video Data

Download the full source code from GitHub

YouTube_Scraper_Python_Thumbnail

Table of content

Introduction

YouTube can provide a wealth of information regarding various trends and audience insights which may help you with content analysis through approaches like popular topic tracking. This tutorial will mainly look over two approaches to scrape YouTube:

  1. Fetches data using YouTube API An official method to fetch structured data directly from YouTube, including video titles, descriptions, views, likes, comments, and more.
  2. Scraping without API using Selenium Stealth to evade the bot detection and directly scrape data from website.

You will learn both strategies in the end and can choose which is more appropriate for your project requirements.

Getting started with the YouTube API

Now, lets get to work:

 Setup Google Cloud Project

To begin using YouTube Data API you have to create a project at Google Cloud Console:

Step 1 : Open Google Cloud Console

At the top, click Select a Project and select New Project.

Add a name for your project and some directory location.

Click “Create”. After the project is created, you’ll see it in your available projects.

create project

Step 2: Activate Youtube Data API

Once you have created your project, you need to activate the YouTube Data API:

Navigate to APIs & Services > Library in Google Cloud Console.

Search, Click on YouTube Data API v3

Click to insert it in your project.

youtube demo

Click on APIs & Services > Credentials and generate some API credentials.

create credentials

Click Create Credentials and select API key Make a copy of the key, because it will be needed in API calls.

Step3: Install Required Libraries

Now that you have your API key, You will need some python libraries to fetch data from the API and process it:

We can install the google-api-python-client library by executing:

pip install  google-api-python-client

This guide will assist you to authenticated with YouTube API and give you access to the video details. Now you can start building your YouTube scrappers!

Creating YouTube Scraper Based On Keywords 

Define Your Search Parameters We can utilize the search endpoint of youtube to find videos by keyword. With the help of this end point, we can filter out our search results on various parameters e.g:

 q : The search string.

 maxResults: Maximum number of results to return in a single request (up to 50). 

type: When making a search request to the YouTube API, you can use the type parameter to specify exactly what kind of content you want in the results. This helps you control whether you get videos, channels, or playlists in response. 

relevanceLanguage: “en” for english 

Taking location and Radius: If you want to narrow down your search results to a specific area, you can use the location and radius settings.  

Then we move to next step which is Writing a Keyword Based Scraping code:

from googleapiclient.discovery import build
import csv

# Define your API Key here
API_KEY = 'erewrwer2h62JiHrpCCMGrewrwerwerwes'  # Replace with your actual API key
# Build the YouTube API service
youtube = build('youtube', 'v3', developerKey=API_KEY)

# Define the search function with location and language
def youtube_search(keyword, max_results=5, latitude=None, longitude=None, radius="50km", language="en"):
    # Prepare parameters for location search if latitude and longitude are provided
    search_params = {
        "part": "snippet",
        "q": keyword,
        "maxResults": max_results,
        "type": "video",
        "relevanceLanguage": language  # Specify the relevance language
    }
    # Add location parameters if latitude and longitude are provided
    if latitude is not None and longitude is not None:
        search_params["location"] = f"{latitude},{longitude}"
        search_params["locationRadius"] = radius

    # Call the search.list method to retrieve results matching the keyword, location, and language
    request = youtube.search().list(**search_params)
    response = request.execute()
    
    # List to store video details for CSV
    video_data = []


    # Print important video details
    for item in response.get('items', []):
        video_id = item['id']['videoId']
        snippet = item['snippet']
        
        # Extract 20 important data points
        details = {
            "Title": snippet.get("title", "N/A"),
            "Channel Name": snippet.get("channelTitle", "N/A"),
            "Video URL": f"https://www.youtube.com/watch?v={video_id}",
            "Description": snippet.get("description", "N/A"),
            "Publish Date": snippet.get("publishedAt", "N/A"),
            "Channel ID": snippet.get("channelId", "N/A"),
            "Video ID": video_id,
            "Thumbnail URL": snippet.get("thumbnails", {}).get("high", {}).get("url", "N/A"),
            "Location Radius": radius,
            "Relevance Language": language,
            "Latitude": latitude if latitude else "N/A",
            "Longitude": longitude if longitude else "N/A",
        
        }

        # Append details to video_data for saving to CSV
        video_data.append(details)

        # Print the extracted details
        print("nVideo Details:")
        for key, value in details.items():
            print(f"{key}: {value}")
    
    # Save video details to a CSV file
    with open('youtube_videos.csv', 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = video_data[0].keys()
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(video_data)

    print("Video details saved to youtube_videos.csv")

# Example usage: Search for videos by keyword, location, and language
# Location: San Francisco (latitude: 37.7749, longitude: -122.4194), Language: English
youtube_search("Python tutorial", max_results=50, latitude=37.7749, longitude=-122.4194, radius="50km", language="en")

Explanation of the Code:

youtube_search This function search videos based on keyword and also passes additional parameters like location,language etc. It also pulls out important data points like video title, channel name, video URL, description and date published.

here video_data is initialized with an empty list in which a dictionary containing details about each video will be saved. After obtaining all of the results, they are then saved in a CSV file youtube_videos.csv using python built in csv module. 

The CSV further comprises the rows with columns such as Title, Channel Name, Video URL, etc. that enable analysis. It allows you to store analysis and use powerful tools to analyze YouTube data, for further processing or distribution. This is how it looked like :

youtube_api_search_keyword

Getting Detailed Video Stats via YouTube API

In this part, we will discuss collecting more detailed information using Youtube Data API for each video. If you analyze the content well, then this data is worth using to find out how long a person watches your video or in general individual insights on videos. 

Find Video ID: This means that in order to find the information for a video we will need that video id. 

Each YouTube video has a unique identifying number that exists at the end of the current URL.

Link: https://www.youtube.com/watch?v=_uQrJ0TkZlc 

Video ID: _uQrJ0TkZlc 

Get the video’s data via the API

We will proceed to use the videos endpoint of the YouTube API as this endpoint provides us with multiple information about every video like

  • Title and description
  • Tags used by the creator
  • views,  likes and comments
  • Date of publication, length and quality of the video

The following code retrieves this data. The code includes a function that helps you input any video URL by extracting the ID from Full You Tube URL.

from googleapiclient.discovery import build
import re, csv

# Define your API Key here
API_KEY = 'dfdfdfdadasdasdQehkDsdsdMGgeaIs'  # Replace with your actual API key

# Build the YouTube API service
youtube = build('youtube', 'v3', developerKey=API_KEY)

# Function to extract video ID from a YouTube URL
def extract_video_id(url):
    # Regular expression to match YouTube video ID
    pattern = r"(?:v=|/)([0-9A-Za-z_-]{11}).*"
    match = re.search(pattern, url)
    if match:
        return match.group(1)
    return None

# Function to get video details
def get_video_details(url):
    video_id = extract_video_id(url)
    if not video_id:
        print("Invalid video URL")
        return

    # Call the videos.list method to retrieve video details
    request = youtube.videos().list(
        part="snippet,contentDetails,statistics",
        id=video_id
    )
    response = request.execute()
 
    # Check if the video exists
    if "items" not in response or not response["items"]:
        print("Video not found.")
        return
    
    # Extract video details
    # Parsing and displaying important video details
    video = response["items"][0]

    details = {
        "Title": video["snippet"]["title"],
        "Channel Name": video["snippet"]["channelTitle"],
        "Published At": video["snippet"]["publishedAt"],
        "Description": video["snippet"]["description"],
        "Views": video["statistics"].get("viewCount", "N/A"),
        "Likes": video["statistics"].get("likeCount", "N/A"),
        "Comments": video["statistics"].get("commentCount", "N/A"),
        "Duration": video["contentDetails"]["duration"],
        "Tags": ', '.join(video["snippet"].get("tags", [])),
        "Category ID": video["snippet"]["categoryId"],
        "Default Language": video["snippet"].get("defaultLanguage", "N/A"),
        "Dimension": video["contentDetails"]["dimension"],
        "Definition": video["contentDetails"]["definition"],
        "Captions Available": video["contentDetails"]["caption"],
        "Licensed Content": video["contentDetails"]["licensedContent"]
    }

    # Displaying the details
    print(details)
    
    # Save details to CSV
    with open('video_details.csv', 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=details.keys())
        writer.writeheader()
        writer.writerow(details)
 
    print("Video details saved to video_details.csv")

 
# Example usage
get_video_details("https://www.youtube.com/watch?v=_uQrJ0TkZlc")

Explanation of the Code:

Initialize API: This build('youtube', 'v3', developerKey=API_KEY)line of the code initializes the You Tube API passing through your api key.

Extracting the Video ID : The extract_video_id() function scans a given YouTube URL and returns the video ID (strings that are unique to each video) using a regular expression. This ID is required so that you can get your video info.

Getting Video Information: The get_video_details() method Makes a call to the videos endpoint of YouTube API with video ID. Gets title, views, tags along with other information.

Write CSV: This code stores YouTube video detail into a CSV file by opening up a file named video_details. csv in write mode. It then sets up a csv. Then passes the header as a list of column names to DictWriter then writes the column headers from the keys of details. The writeheader() method writes these headers to the file, after which the writer writes the actual video details.   

youtube api search vedio details

API Rate Limitations

If you want to use the official YouTube API, please remember its rate limitation. Due to the daily request quota system YouTube applies, there is a limit on how much your project can call per day. Here’s how it works:

  1.  Every project gets 10,000 quota units per day.
  2. Quota units are used up in different amounts for various types of API requests. For example:
    • Every search request costs around ~100 units.
    •  Video detail requests (getting a videos title, description etc.) normally equivalent to 1 unit.
    •  Retrieving comments are charged at 2 or more units.

Going over the daily limit may lead to your requests being blocked until the quota resets. To address this, you should always try to optimize your requests so that you are only retrieving the information that you need and if at all possible, batching queries.

Youtube Scraping without API: Selenium Stealth

When the YouTube API does not provide all the data we are looking for, we can use web scraping. On the other hand, we will Selenium with Stealth so it passes YouTube bot detection. In short, it involves loading the YouTube page and obtaining video information directly from JSON data that is embedded in the header of the web page, and if necessary scrape Related videos by automating login.

JSON-LD Data: Extracting Video Details

import json
import time
import re
from selenium import webdriver
from selenium_stealth import stealth
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 


# Initialize Selenium WebDriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")

# options.add_argument("--headless")

options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

 

driver = webdriver.Chrome(options=options)

# Function to extract video details using JSON-LD data with regex
def get_video_details(url):
    # Apply Selenium Stealth to avoid detection
    stealth(
        driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
    )

    driver.get(url)
    time.sleep(5) 

    # Extract the page source
    page_source = driver.page_source

    # Use regex to find the JSON-LD data for VideoObject
    match = re.search(r'({[^}]+"@type":"VideoObject"[^}]+})', page_source)
    if not match:
        print("No JSON-LD data found.")
        return

    # Parse JSON-LD data
    json_data = json.loads(match.group(1))

    # Extract the top 20 most important video details
    details = {
        "Title": json_data.get("name", "N/A"),
        "Description": json_data.get("description", "N/A"),
        "Duration": json_data.get("duration", "N/A"),
        "Embed URL": json_data.get("embedUrl", "N/A"),
        "Views": json_data.get("interactionCount", "N/A"),
        "Thumbnail URL": json_data.get("thumbnailUrl", ["N/A"])[0],
        "Upload Date": json_data.get("uploadDate", "N/A"),
        "Genre": json_data.get("genre", "N/A"),
        "Channel Name": json_data.get("author", "N/A"),
        "Context": json_data.get("@context", "N/A"),
        "Type": json_data.get("@type", "N/A"),
      
    }

    # Print the extracted details
    for key, value in details.items():
        print(f"{key}: {value}")

   

# Example usage
get_video_details("https://www.youtube.com/watch?v=_uQrJ0TkZlc")
 

Explanation of the Code:

Stealth: We use stealth library for bypass bot detection. 

Loading the Page: The driver.get(url)command opens the page and it gives a 5-second wait so that page is fully loaded.

Extracting JSON-LD Data:

We will use page_source to get the complete HTML of our page.

We define regular expression to match JSON-LD metadata of type "type":"VideoObject" If no match is found, an error message is shown. Otherwise, the data is parsed into a JSON object for further use

Finally, we retrieve and display some information about the video (title, description, duration, views. etc).

Related Videos (Using Default Browser Profile)

We can efficiently scrape related videos by using a pre-logged-in Chrome profile to skip automated login. Here’s the approach:

Utilizing Browser Profile: We passed in the Chrome user-data-dir and profile-directory flags to a profile that was already logged in YouTube, skipping all additional logging steps.

Passing Profile to Selenium:  When we load this profile in selenium, we will hit a related video part of Youtube. Doing this by simulating clicks on the “next” and “related” buttons.

This reduces chances of it getting detected as a bot, but will also need to be changed when the layout of the page changes on YouTube.

# Specify the Chrome user data directory up to "User Data" only
options.add_argument(r"user-data-dir=C:UsersfarhanAppDataLocalGoogleChromeUser Data")

# Specify the profile directory (e.g., "Profile 17")
options.add_argument("profile-directory=Profile 17")

here is full code:

import json
import time
import re
from selenium import webdriver
from selenium_stealth import stealth
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv 


# Initialize Selenium WebDriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")

# options.add_argument("--headless")

options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

# Specify the Chrome user data directory up to "User Data" only
options.add_argument(r"user-data-dir=C:UsersfarhanAppDataLocalGoogleChromeUser Data")

# Specify the profile directory (e.g., "Profile 17")
options.add_argument("profile-directory=Profile 17")

driver = webdriver.Chrome(options=options)

# Function to extract video details using JSON-LD data with regex
def get_video_details(url):
    # Apply Selenium Stealth to avoid detection
    stealth(
        driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
    )

    driver.get(url)
    time.sleep(5) 

    # Extract the page source
    page_source = driver.page_source

    # Use regex to find the JSON-LD data for VideoObject
    match = re.search(r'({[^}]+"@type":"VideoObject"[^}]+})', page_source)
    if not match:
        print("No JSON-LD data found.")
        return

    # Parse JSON-LD data
    json_data = json.loads(match.group(1))

    # Extract the top 20 most important video details
    details = {
        "Title": json_data.get("name", "N/A"),
        "Description": json_data.get("description", "N/A"),
        "Duration": json_data.get("duration", "N/A"),
        "Embed URL": json_data.get("embedUrl", "N/A"),
        "Views": json_data.get("interactionCount", "N/A"),
        "Thumbnail URL": json_data.get("thumbnailUrl", ["N/A"])[0],
        "Upload Date": json_data.get("uploadDate", "N/A"),
        "Genre": json_data.get("genre", "N/A"),
        "Channel Name": json_data.get("author", "N/A"),
        "Context": json_data.get("@context", "N/A"),
        "Type": json_data.get("@type", "N/A"),
        "Related URLs": []  # Initialize as an empty list
    }

    # Print the extracted details
    for key, value in details.items():
        print(f"{key}: {value}")

    try:
        while True:
            time.sleep(3)
            # Loop to click the "Next" arrow until the "Related" button is visible
            # Click the "Next" arrow if the "Related" button isn't found
            next_arrow = WebDriverWait(driver, 2).until(
                        EC.element_to_be_clickable((By.XPATH, "//div[@id='right-arrow-button']//button"))
                    )
            next_arrow.click()
            time.sleep(3)  # Short delay to allow elements to load
    
            # Try to locate the "Related" button
            related_button = driver.find_element(By.XPATH, "//yt-chip-cloud-chip-renderer[.//yt-formatted-string[@title='Related']]")
            if related_button.is_displayed():
                    related_button.click()
                    print("Clicked on the 'Related' button.")
                    time.sleep(3)
                    all_related_vedio_url = r'yt-simple-[^>]+video-renderer[^>]+href="([^"]+)'
                    urls = re.findall(all_related_vedio_url,page_source)
                    # Add the related URLs to the list in `details`
                    details["Related URLs"].extend([f"https://www.youtube.com{url}" for url in urls])
                    for url in urls:
                        print(f"https://www.youtube.com{url}") 
                    break

    except Exception as e:
        print("Could not find or click the 'Related' button:", e)
    
    # Join related URLs as a single string separated by commas
    details["Related URLs"] = ", ".join(details["Related URLs"])

    # Save details to CSV
    with open('video_details.csv', 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=details.keys())
        writer.writeheader()
        writer.writerow(details)

# Example usage
get_video_details("https://www.youtube.com/watch?v=_uQrJ0TkZlc")
 

Bypassing Bot Detection with Proxies

When you’re scraping websites like YouTube, one of the biggest challenges is avoiding detection and getting blocked. That’s where proxies come in handy. 

Proxy mask your real IP address and make it seem like you’re browsing from a completely different location. This makes it much harder for websites to tell that it’s a bot behind the screen.

 In most of my tutorials, I use Rayobyte proxies because they’re pretty reliable, but honestly, you can choose any proxy service that works for you. The important part is to keep things looking natural, spread your requests out, and make sure you’re not sending too many too quickly. Here, I demonstrate how you can easily integrate a proxy.

import json
import time
import re
from selenium import webdriver
from selenium_stealth import stealth
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv 


# Function to create proxy authentication extension
def create_proxy_auth_extension(proxy_host, proxy_user, proxy_pass):
    import zipfile
    import os

    # Separate the host and port
    host = proxy_host.split(':')[0]  # Extract the host part (e.g., "la.residential.rayobyte.com")
    port = proxy_host.split(':')[1]  # Extract the port part (e.g., "8000")

    # Define proxy extension files
    manifest_json = """
    {
        "version": "1.0.0",
        "manifest_version": 2,
        "name": "Chrome Proxy",
        "permissions": [
            "proxy",
            "tabs",
            "unlimitedStorage",
            "storage",
            "<all_urls>",
            "webRequest",
            "webRequestBlocking"
        ],
        "background": {
            "scripts": ["background.js"]
        },
        "minimum_chrome_version":"22.0.0"
    }
    """
    
    background_js = f"""
    var config = {{
            mode: "fixed_servers",
            rules: {{
              singleProxy: {{
                scheme: "http",
                host: "{host}",
                port: parseInt({port})
              }},
              bypassList: ["localhost"]
            }}
          }};
    chrome.proxy.settings.set({{value: config, scope: "regular"}}, function() {{}});
    chrome.webRequest.onAuthRequired.addListener(
        function(details) {{
            return {{
                authCredentials: {{
                    username: "{proxy_user}",
                    password: "{proxy_pass}"
                }}
            }};
        }},
        {{urls: ["<all_urls>"]}},
        ["blocking"]
    );
    """

    # Create the extension
    pluginfile = 'proxy_auth_plugin.zip'
    with zipfile.ZipFile(pluginfile, 'w') as zp:
        zp.writestr("manifest.json", manifest_json)
        zp.writestr("background.js", background_js)

    return pluginfile
 

# Proxy configuration
proxy_server = "server_name:port"  # Replace with your proxy server and port
proxy_username = "username"  # Replace with your proxy username
proxy_password = "password"  # Replace with your proxy password


# Initialize Selenium WebDriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument(f'--proxy-server={proxy_server}')
# options.add_argument("--headless")

options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

# Add proxy authentication if necessary (for proxies that require username/password)
if proxy_username and proxy_password:
        # Chrome does not support proxy authentication directly; use an extension for proxy authentication
        options.add_extension(create_proxy_auth_extension(proxy_server, proxy_username, proxy_password))

driver = webdriver.Chrome(options=options)

# Function to extract video details using JSON-LD data with regex
def get_video_details(url):
    # Apply Selenium Stealth to avoid detection
    stealth(
        driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
    )

    driver.get(url)
    time.sleep(5) 

    # Extract the page source
    page_source = driver.page_source

    # Use regex to find the JSON-LD data for VideoObject
    match = re.search(r'({[^}]+"@type":"VideoObject"[^}]+})', page_source)
    if not match:
        print("No JSON-LD data found.")
        return

    # Parse JSON-LD data
    json_data = json.loads(match.group(1))

    # Extract the top 20 most important video details
    details = {
        "Title": json_data.get("name", "N/A"),
        "Description": json_data.get("description", "N/A"),
        "Duration": json_data.get("duration", "N/A"),
        "Embed URL": json_data.get("embedUrl", "N/A"),
        "Views": json_data.get("interactionCount", "N/A"),
        "Thumbnail URL": json_data.get("thumbnailUrl", ["N/A"])[0],
        "Upload Date": json_data.get("uploadDate", "N/A"),
        "Genre": json_data.get("genre", "N/A"),
        "Channel Name": json_data.get("author", "N/A"),
        "Context": json_data.get("@context", "N/A"),
        "Type": json_data.get("@type", "N/A"),
         
    }

    # Print the extracted details
    for key, value in details.items():
        print(f"{key}: {value}")

 

    # Save details to CSV
    with open('video_details.csv', 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=details.keys())
        writer.writeheader()
        writer.writerow(details)

# Example usage
get_video_details("https://www.youtube.com/watch?v=_uQrJ0TkZlc")
 


Download the full source code from GitHub 

Watch the full tutorial on YouTube 

Responses

Related Projects

google shopping scraper python
yahoo search
Bing search 1
b9929b09 167f 4365 9087 fddf3278a679