Build a YouTube Scraper in Python to Extract Video Data
Download the full source code from GitHub
Table of content
- Introduction
- Getting started with the YouTube API
- Setup Google Cloud Project
- YouTube Scraper Based On Keywords
- Detailed Video Stats via YouTube API
- Get the video’s data via the API
- API Rate Limitations
- Youtube Scraping without API
- Selenium Stealth
- JSON-LD Data
- Related Videos
- Bypassing Bot Detection with Proxies
Introduction
YouTube can provide a wealth of information regarding various trends and audience insights which may help you with content analysis through approaches like popular topic tracking. This tutorial will mainly look over two approaches to scrape YouTube:
- Fetches data using YouTube API An official method to fetch structured data directly from YouTube, including video titles, descriptions, views, likes, comments, and more.
- Scraping without API using Selenium Stealth to evade the bot detection and directly scrape data from website.
You will learn both strategies in the end and can choose which is more appropriate for your project requirements.
Getting started with the YouTube API
Now, lets get to work:
Setup Google Cloud Project
To begin using YouTube Data API you have to create a project at Google Cloud Console:
Step 1 : Open Google Cloud Console
At the top, click Select a Project and select New Project.
Add a name for your project and some directory location.
Click “Create”. After the project is created, you’ll see it in your available projects.
Step 2: Activate Youtube Data API
Once you have created your project, you need to activate the YouTube Data API:
Navigate to APIs & Services > Library in Google Cloud Console.
Search, Click on YouTube Data API v3
Click to insert it in your project.
Click on APIs & Services > Credentials and generate some API credentials.
Click Create Credentials and select API key Make a copy of the key, because it will be needed in API calls.
Step3: Install Required Libraries
Now that you have your API key, You will need some python libraries to fetch data from the API and process it:
We can install the google-api-python-client library by executing:
pip install google-api-python-client
This guide will assist you to authenticated with YouTube API and give you access to the video details. Now you can start building your YouTube scrappers!
Creating YouTube Scraper Based On Keywords
Define Your Search Parameters We can utilize the search endpoint of youtube to find videos by keyword. With the help of this end point, we can filter out our search results on various parameters e.g:
q : The search string.
maxResults: Maximum number of results to return in a single request (up to 50).
type: When making a search request to the YouTube API, you can use the type parameter to specify exactly what kind of content you want in the results. This helps you control whether you get videos, channels, or playlists in response.
relevanceLanguage: “en” for english
Taking location and Radius: If you want to narrow down your search results to a specific area, you can use the location and radius settings.
Then we move to next step which is Writing a Keyword Based Scraping code:
from googleapiclient.discovery import build import csv # Define your API Key here API_KEY = 'erewrwer2h62JiHrpCCMGrewrwerwerwes' # Replace with your actual API key # Build the YouTube API service youtube = build('youtube', 'v3', developerKey=API_KEY) # Define the search function with location and language def youtube_search(keyword, max_results=5, latitude=None, longitude=None, radius="50km", language="en"): # Prepare parameters for location search if latitude and longitude are provided search_params = { "part": "snippet", "q": keyword, "maxResults": max_results, "type": "video", "relevanceLanguage": language # Specify the relevance language } # Add location parameters if latitude and longitude are provided if latitude is not None and longitude is not None: search_params["location"] = f"{latitude},{longitude}" search_params["locationRadius"] = radius # Call the search.list method to retrieve results matching the keyword, location, and language request = youtube.search().list(**search_params) response = request.execute() # List to store video details for CSV video_data = [] # Print important video details for item in response.get('items', []): video_id = item['id']['videoId'] snippet = item['snippet'] # Extract 20 important data points details = { "Title": snippet.get("title", "N/A"), "Channel Name": snippet.get("channelTitle", "N/A"), "Video URL": f"https://www.youtube.com/watch?v={video_id}", "Description": snippet.get("description", "N/A"), "Publish Date": snippet.get("publishedAt", "N/A"), "Channel ID": snippet.get("channelId", "N/A"), "Video ID": video_id, "Thumbnail URL": snippet.get("thumbnails", {}).get("high", {}).get("url", "N/A"), "Location Radius": radius, "Relevance Language": language, "Latitude": latitude if latitude else "N/A", "Longitude": longitude if longitude else "N/A", } # Append details to video_data for saving to CSV video_data.append(details) # Print the extracted details print("nVideo Details:") for key, value in details.items(): print(f"{key}: {value}") # Save video details to a CSV file with open('youtube_videos.csv', 'w', newline='', encoding='utf-8') as csvfile: fieldnames = video_data[0].keys() writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() writer.writerows(video_data) print("Video details saved to youtube_videos.csv") # Example usage: Search for videos by keyword, location, and language # Location: San Francisco (latitude: 37.7749, longitude: -122.4194), Language: English youtube_search("Python tutorial", max_results=50, latitude=37.7749, longitude=-122.4194, radius="50km", language="en")
Explanation of the Code:
youtube_search This function search videos based on keyword and also passes additional parameters like location,language etc. It also pulls out important data points like video title, channel name, video URL, description and date published.
here video_data is initialized with an empty list in which a dictionary containing details about each video will be saved. After obtaining all of the results, they are then saved in a CSV file youtube_videos.csv using python built in csv module.
The CSV further comprises the rows with columns such as Title, Channel Name, Video URL, etc. that enable analysis. It allows you to store analysis and use powerful tools to analyze YouTube data, for further processing or distribution. This is how it looked like :
Getting Detailed Video Stats via YouTube API
In this part, we will discuss collecting more detailed information using Youtube Data API for each video. If you analyze the content well, then this data is worth using to find out how long a person watches your video or in general individual insights on videos.
Find Video ID: This means that in order to find the information for a video we will need that video id.
Each YouTube video has a unique identifying number that exists at the end of the current URL.
Link: https://www.youtube.com/watch?v=_uQrJ0TkZlc
Video ID: _uQrJ0TkZlc
Get the video’s data via the API
We will proceed to use the videos endpoint of the YouTube API as this endpoint provides us with multiple information about every video like
- Title and description
- Tags used by the creator
- views, likes and comments
- Date of publication, length and quality of the video
The following code retrieves this data. The code includes a function that helps you input any video URL by extracting the ID from Full You Tube URL.
from googleapiclient.discovery import build import re, csv # Define your API Key here API_KEY = 'dfdfdfdadasdasdQehkDsdsdMGgeaIs' # Replace with your actual API key # Build the YouTube API service youtube = build('youtube', 'v3', developerKey=API_KEY) # Function to extract video ID from a YouTube URL def extract_video_id(url): # Regular expression to match YouTube video ID pattern = r"(?:v=|/)([0-9A-Za-z_-]{11}).*" match = re.search(pattern, url) if match: return match.group(1) return None # Function to get video details def get_video_details(url): video_id = extract_video_id(url) if not video_id: print("Invalid video URL") return # Call the videos.list method to retrieve video details request = youtube.videos().list( part="snippet,contentDetails,statistics", id=video_id ) response = request.execute() # Check if the video exists if "items" not in response or not response["items"]: print("Video not found.") return # Extract video details # Parsing and displaying important video details video = response["items"][0] details = { "Title": video["snippet"]["title"], "Channel Name": video["snippet"]["channelTitle"], "Published At": video["snippet"]["publishedAt"], "Description": video["snippet"]["description"], "Views": video["statistics"].get("viewCount", "N/A"), "Likes": video["statistics"].get("likeCount", "N/A"), "Comments": video["statistics"].get("commentCount", "N/A"), "Duration": video["contentDetails"]["duration"], "Tags": ', '.join(video["snippet"].get("tags", [])), "Category ID": video["snippet"]["categoryId"], "Default Language": video["snippet"].get("defaultLanguage", "N/A"), "Dimension": video["contentDetails"]["dimension"], "Definition": video["contentDetails"]["definition"], "Captions Available": video["contentDetails"]["caption"], "Licensed Content": video["contentDetails"]["licensedContent"] } # Displaying the details print(details) # Save details to CSV with open('video_details.csv', 'w', newline='', encoding='utf-8') as csvfile: writer = csv.DictWriter(csvfile, fieldnames=details.keys()) writer.writeheader() writer.writerow(details) print("Video details saved to video_details.csv") # Example usage get_video_details("https://www.youtube.com/watch?v=_uQrJ0TkZlc")
Explanation of the Code:
Initialize API: This build('youtube', 'v3', developerKey=API_KEY)
line of the code initializes the You Tube API passing through your api key.
Extracting the Video ID : The extract_video_id() function scans a given YouTube URL and returns the video ID (strings that are unique to each video) using a regular expression. This ID is required so that you can get your video info.
Getting Video Information: The get_video_details() method Makes a call to the videos endpoint of YouTube API with video ID. Gets title, views, tags along with other information.
Write CSV: This code stores YouTube video detail into a CSV file by opening up a file named video_details. csv in write mode. It then sets up a csv. Then passes the header as a list of column names to DictWriter then writes the column headers from the keys of details. The writeheader() method writes these headers to the file, after which the writer writes the actual video details.
API Rate Limitations
If you want to use the official YouTube API, please remember its rate limitation. Due to the daily request quota system YouTube applies, there is a limit on how much your project can call per day. Here’s how it works:
- Every project gets 10,000 quota units per day.
- Quota units are used up in different amounts for various types of API requests. For example:
-
- Every search request costs around ~100 units.
- Video detail requests (getting a videos title, description etc.) normally equivalent to 1 unit.
- Retrieving comments are charged at 2 or more units.
Going over the daily limit may lead to your requests being blocked until the quota resets. To address this, you should always try to optimize your requests so that you are only retrieving the information that you need and if at all possible, batching queries.
Youtube Scraping without API: Selenium Stealth
When the YouTube API does not provide all the data we are looking for, we can use web scraping. On the other hand, we will Selenium with Stealth so it passes YouTube bot detection. In short, it involves loading the YouTube page and obtaining video information directly from JSON data that is embedded in the header of the web page, and if necessary scrape Related videos by automating login.
JSON-LD Data: Extracting Video Details
import json import time import re from selenium import webdriver from selenium_stealth import stealth from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Initialize Selenium WebDriver options = webdriver.ChromeOptions() options.add_argument("start-maximized") # options.add_argument("--headless") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(options=options) # Function to extract video details using JSON-LD data with regex def get_video_details(url): # Apply Selenium Stealth to avoid detection stealth( driver, languages=["en-US", "en"], vendor="Google Inc.", platform="Win32", webgl_vendor="Intel Inc.", renderer="Intel Iris OpenGL Engine", fix_hairline=True, ) driver.get(url) time.sleep(5) # Extract the page source page_source = driver.page_source # Use regex to find the JSON-LD data for VideoObject match = re.search(r'({[^}]+"@type":"VideoObject"[^}]+})', page_source) if not match: print("No JSON-LD data found.") return # Parse JSON-LD data json_data = json.loads(match.group(1)) # Extract the top 20 most important video details details = { "Title": json_data.get("name", "N/A"), "Description": json_data.get("description", "N/A"), "Duration": json_data.get("duration", "N/A"), "Embed URL": json_data.get("embedUrl", "N/A"), "Views": json_data.get("interactionCount", "N/A"), "Thumbnail URL": json_data.get("thumbnailUrl", ["N/A"])[0], "Upload Date": json_data.get("uploadDate", "N/A"), "Genre": json_data.get("genre", "N/A"), "Channel Name": json_data.get("author", "N/A"), "Context": json_data.get("@context", "N/A"), "Type": json_data.get("@type", "N/A"), } # Print the extracted details for key, value in details.items(): print(f"{key}: {value}") # Example usage get_video_details("https://www.youtube.com/watch?v=_uQrJ0TkZlc")
Explanation of the Code:
Stealth: We use stealth library for bypass bot detection.
Loading the Page: The driver.get(url)
command opens the page and it gives a 5-second wait so that page is fully loaded.
Extracting JSON-LD Data:
We will use page_source to get the complete HTML of our page.
We define regular expression to match JSON-LD metadata of type "type":"VideoObject"
If no match is found, an error message is shown. Otherwise, the data is parsed into a JSON object for further use
Finally, we retrieve and display some information about the video (title, description, duration, views. etc).
Related Videos (Using Default Browser Profile)
We can efficiently scrape related videos by using a pre-logged-in Chrome profile to skip automated login. Here’s the approach:
Utilizing Browser Profile: We passed in the Chrome user-data-dir and profile-directory flags to a profile that was already logged in YouTube, skipping all additional logging steps.
Passing Profile to Selenium: When we load this profile in selenium, we will hit a related video part of Youtube. Doing this by simulating clicks on the “next” and “related” buttons.
This reduces chances of it getting detected as a bot, but will also need to be changed when the layout of the page changes on YouTube.
# Specify the Chrome user data directory up to "User Data" only options.add_argument(r"user-data-dir=C:UsersfarhanAppDataLocalGoogleChromeUser Data") # Specify the profile directory (e.g., "Profile 17") options.add_argument("profile-directory=Profile 17")
here is full code:
import json import time import re from selenium import webdriver from selenium_stealth import stealth from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import csv # Initialize Selenium WebDriver options = webdriver.ChromeOptions() options.add_argument("start-maximized") # options.add_argument("--headless") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) # Specify the Chrome user data directory up to "User Data" only options.add_argument(r"user-data-dir=C:UsersfarhanAppDataLocalGoogleChromeUser Data") # Specify the profile directory (e.g., "Profile 17") options.add_argument("profile-directory=Profile 17") driver = webdriver.Chrome(options=options) # Function to extract video details using JSON-LD data with regex def get_video_details(url): # Apply Selenium Stealth to avoid detection stealth( driver, languages=["en-US", "en"], vendor="Google Inc.", platform="Win32", webgl_vendor="Intel Inc.", renderer="Intel Iris OpenGL Engine", fix_hairline=True, ) driver.get(url) time.sleep(5) # Extract the page source page_source = driver.page_source # Use regex to find the JSON-LD data for VideoObject match = re.search(r'({[^}]+"@type":"VideoObject"[^}]+})', page_source) if not match: print("No JSON-LD data found.") return # Parse JSON-LD data json_data = json.loads(match.group(1)) # Extract the top 20 most important video details details = { "Title": json_data.get("name", "N/A"), "Description": json_data.get("description", "N/A"), "Duration": json_data.get("duration", "N/A"), "Embed URL": json_data.get("embedUrl", "N/A"), "Views": json_data.get("interactionCount", "N/A"), "Thumbnail URL": json_data.get("thumbnailUrl", ["N/A"])[0], "Upload Date": json_data.get("uploadDate", "N/A"), "Genre": json_data.get("genre", "N/A"), "Channel Name": json_data.get("author", "N/A"), "Context": json_data.get("@context", "N/A"), "Type": json_data.get("@type", "N/A"), "Related URLs": [] # Initialize as an empty list } # Print the extracted details for key, value in details.items(): print(f"{key}: {value}") try: while True: time.sleep(3) # Loop to click the "Next" arrow until the "Related" button is visible # Click the "Next" arrow if the "Related" button isn't found next_arrow = WebDriverWait(driver, 2).until( EC.element_to_be_clickable((By.XPATH, "//div[@id='right-arrow-button']//button")) ) next_arrow.click() time.sleep(3) # Short delay to allow elements to load # Try to locate the "Related" button related_button = driver.find_element(By.XPATH, "//yt-chip-cloud-chip-renderer[.//yt-formatted-string[@title='Related']]") if related_button.is_displayed(): related_button.click() print("Clicked on the 'Related' button.") time.sleep(3) all_related_vedio_url = r'yt-simple-[^>]+video-renderer[^>]+href="([^"]+)' urls = re.findall(all_related_vedio_url,page_source) # Add the related URLs to the list in `details` details["Related URLs"].extend([f"https://www.youtube.com{url}" for url in urls]) for url in urls: print(f"https://www.youtube.com{url}") break except Exception as e: print("Could not find or click the 'Related' button:", e) # Join related URLs as a single string separated by commas details["Related URLs"] = ", ".join(details["Related URLs"]) # Save details to CSV with open('video_details.csv', 'w', newline='', encoding='utf-8') as csvfile: writer = csv.DictWriter(csvfile, fieldnames=details.keys()) writer.writeheader() writer.writerow(details) # Example usage get_video_details("https://www.youtube.com/watch?v=_uQrJ0TkZlc")
Bypassing Bot Detection with Proxies
When you’re scraping websites like YouTube, one of the biggest challenges is avoiding detection and getting blocked. That’s where proxies come in handy.
Proxy mask your real IP address and make it seem like you’re browsing from a completely different location. This makes it much harder for websites to tell that it’s a bot behind the screen.
In most of my tutorials, I use Rayobyte proxies because they’re pretty reliable, but honestly, you can choose any proxy service that works for you. The important part is to keep things looking natural, spread your requests out, and make sure you’re not sending too many too quickly. Here, I demonstrate how you can easily integrate a proxy.
import json import time import re from selenium import webdriver from selenium_stealth import stealth from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import csv # Function to create proxy authentication extension def create_proxy_auth_extension(proxy_host, proxy_user, proxy_pass): import zipfile import os # Separate the host and port host = proxy_host.split(':')[0] # Extract the host part (e.g., "la.residential.rayobyte.com") port = proxy_host.split(':')[1] # Extract the port part (e.g., "8000") # Define proxy extension files manifest_json = """ { "version": "1.0.0", "manifest_version": 2, "name": "Chrome Proxy", "permissions": [ "proxy", "tabs", "unlimitedStorage", "storage", "<all_urls>", "webRequest", "webRequestBlocking" ], "background": { "scripts": ["background.js"] }, "minimum_chrome_version":"22.0.0" } """ background_js = f""" var config = {{ mode: "fixed_servers", rules: {{ singleProxy: {{ scheme: "http", host: "{host}", port: parseInt({port}) }}, bypassList: ["localhost"] }} }}; chrome.proxy.settings.set({{value: config, scope: "regular"}}, function() {{}}); chrome.webRequest.onAuthRequired.addListener( function(details) {{ return {{ authCredentials: {{ username: "{proxy_user}", password: "{proxy_pass}" }} }}; }}, {{urls: ["<all_urls>"]}}, ["blocking"] ); """ # Create the extension pluginfile = 'proxy_auth_plugin.zip' with zipfile.ZipFile(pluginfile, 'w') as zp: zp.writestr("manifest.json", manifest_json) zp.writestr("background.js", background_js) return pluginfile # Proxy configuration proxy_server = "server_name:port" # Replace with your proxy server and port proxy_username = "username" # Replace with your proxy username proxy_password = "password" # Replace with your proxy password # Initialize Selenium WebDriver options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_argument(f'--proxy-server={proxy_server}') # options.add_argument("--headless") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) # Add proxy authentication if necessary (for proxies that require username/password) if proxy_username and proxy_password: # Chrome does not support proxy authentication directly; use an extension for proxy authentication options.add_extension(create_proxy_auth_extension(proxy_server, proxy_username, proxy_password)) driver = webdriver.Chrome(options=options) # Function to extract video details using JSON-LD data with regex def get_video_details(url): # Apply Selenium Stealth to avoid detection stealth( driver, languages=["en-US", "en"], vendor="Google Inc.", platform="Win32", webgl_vendor="Intel Inc.", renderer="Intel Iris OpenGL Engine", fix_hairline=True, ) driver.get(url) time.sleep(5) # Extract the page source page_source = driver.page_source # Use regex to find the JSON-LD data for VideoObject match = re.search(r'({[^}]+"@type":"VideoObject"[^}]+})', page_source) if not match: print("No JSON-LD data found.") return # Parse JSON-LD data json_data = json.loads(match.group(1)) # Extract the top 20 most important video details details = { "Title": json_data.get("name", "N/A"), "Description": json_data.get("description", "N/A"), "Duration": json_data.get("duration", "N/A"), "Embed URL": json_data.get("embedUrl", "N/A"), "Views": json_data.get("interactionCount", "N/A"), "Thumbnail URL": json_data.get("thumbnailUrl", ["N/A"])[0], "Upload Date": json_data.get("uploadDate", "N/A"), "Genre": json_data.get("genre", "N/A"), "Channel Name": json_data.get("author", "N/A"), "Context": json_data.get("@context", "N/A"), "Type": json_data.get("@type", "N/A"), } # Print the extracted details for key, value in details.items(): print(f"{key}: {value}") # Save details to CSV with open('video_details.csv', 'w', newline='', encoding='utf-8') as csvfile: writer = csv.DictWriter(csvfile, fieldnames=details.keys()) writer.writeheader() writer.writerow(details) # Example usage get_video_details("https://www.youtube.com/watch?v=_uQrJ0TkZlc")
Responses