Web Scraping Indeed with Python: Extract Job Listings and Salary Data
Learn How To Extract Job Listings And Salary Data by Scraping Indeed with Python:
click here for download source code from GitHub
Table of Contents
- Overview
- Challenges
- Botright Overview
- Botright for Google Captcha
- Using Proxy in Botright
- Scraping Job Listings from Indeed
- Scraping Job Details
- Conclusion
Overview
Indeed can be a gold mine in terms of scraping job data. Whether you are looking for job descriptions or want to know about the salary offered by a company, this raw data can be beneficial with circulating information all around in terms of job seekers, recruiters and among others. The issue however is that scraping job listings is not always straightforward. Indeed and other sites do a lot to prevent bots from slurping up their data. Web scraping job listings in general is difficult due to Captcha and IP block issues.
Challenges
Those familiar with building a web scraper have come across the same annoying problems I assume. You scrape a handful of webpages, and then—bam! Either captcha stops you in the snowballs, or worse, your IP is banned because you have requested too much. Use third-party Captcha-solving services (which become expensive and cumbersome very quickly). And all of this just kind of dumps on the life of web scraping jobs and salary data.
Botright Overview
Luckily, **Botright** is here to change the game. It’s a powerful new Python library that’s built specifically to handle these common web scraping challenges. Botright can solve Captchas on its own using AI and computer vision, which means no need for external Captcha-solving APIs. Plus, it works directly with Playwright, so if you already have code written for **web scraping job listings** or other data, Botright can be plugged right in.
You can easily install Botright with these commands:
pip install botright playwright install
Installing BeautifulSoup: Because Botright managed to handle the JavaScript content. On completion we will use BeautifulSoup to parse through the data and extract information such as company name, salary, job title etc. It can be installed using the following command:
pip install beautifulsoup4
Botright doesn’t just look like a real browser—it actually uses a real Chromium browser on your local machine. It’s designed to avoid being detected as a bot, thanks to advanced stealth techniques. Whether you’re scraping **job listings** or other types of data, Botright has you covered. Here’s a quick example of how to get started with Botright in Playwright:
import asyncio import botright # Define the main asynchronous function to perform the browser actions async def main(): # Initialize Botright client to handle anti-bot measures and Captchas botright_client = await botright.Botright() # Launch a new browser session with Botright for stealth browsing browser = await botright_client.new_browser() # Open a new page (tab) in the browser page = await browser.new_page() # Navigate to the Google homepage await page.goto("https://google.com") # Close the Botright client after the tasks are done await botright_client.close() # If the script is run directly, execute the main function if __name__ == "__main__": asyncio.run(main())
Also be patient with the set up, python botright will download models and machine learning libraries such as pytorch. Things to remember : Atleast 5 GB free disk space ( for everything to work).
Now In this blog post, I will walk you through on how we can use Botright in order to scrape job-lists from Indeed and solve google captchas to save the data into a csv. You will understand how this can help you to scrape and also when trying to bypass some anti-scrapping protection techniques.
Botright for Google Captcha
Objective
This time, we will show you how to apply Botright to handle one of the most irritating challenges of web scraping—Google reCAPTCHA. In this guide, we are going to demonstrate how you can solve reCAPTCHA automatically using Botright and continue to scrape the websites smoothly. Here, we are using Google recaptcha demo page.
Code Implementation
This is the code that we are going to use in Botright for solving Google reCAPTCHA.
import asyncio import botright async def main(): botright_client = await botright.Botright() browser = await botright_client.new_browser() page = await browser.new_page() # Visit a page with reCAPTCHA await page.goto("https://www.google.com/recaptcha/api2/demo") # Solve the reCAPTCHA await page.solve_recaptcha() # Wait for some time to ensure reCAPTCHA is solved await asyncio.sleep(2) # Retrieve the CAPTCHA response token from the hidden input field captcha_response = await page.evaluate('''() => { const responseField = document.querySelector('textarea[name="g-recaptcha-response"]'); return responseField ? responseField.value : null; }''') print("CAPTCHA Response Token:", captcha_response) # Attempt to click the reCAPTCHA submit button submit_button = await page.query_selector('#recaptcha-demo-submit') if submit_button: await submit_button.click() await botright_client.close() if __name__ == "__main__": asyncio.run(main())
Code Explanation:
botright_client = botright.Botright()
The above line is used to create a botright client here, for anything such as solving captchas and acting just like any other user.
botright_client.new_browser()
This line tells Botright to open a browser (based on chromium) that has been crafted to appear and mimic human browsing behavior.
browser.new_page()
And now we are opening a new tab in that browser, just like hitting “New Tab” in Chrome.
page. visit('https://www.google.com/recaptcha/api2/demo')
This one directs the browser to go to Google reCAPTCHA demo page where Botright can start solving captains.
The magic happens when we call await page.solve_recaptcha()
. This command tells Botright to detect and solve the CAPTCHA automatically using own built-in AI and computer vision systems. Instead of guessing, Botright behaves like a person.
Response from CAPTCHA : After solving the CAPTCHA, the codes are waiting two seconds using asyncio.sleep(2)
. It then grabs the CAPTCHA response token in the hidden input box which Google uses to verify that all Captcha got solved.
After having CAPTCHA token, we click on the submit button; to simulate the form being submitted.
Using Proxy in Botright
When you send too many requests from the same IP, websites can block you. To avoid this, we use proxies to rotate IP addresses. By changing the IP for each request, we make our scraper harder to detect and prevent blocking. There are a lot of proxy providers available in the market. I am using Rayobyte proxy.
Here’s how you can integrate proxies with Botright:
import asyncio import botright from decouple import config async def main(): for i in range(5): botright_client = await botright.Botright() browser = await botright_client.new_browser(proxy="username:password:server_name:port") page = await browser.new_page() # Continue by using the Page await page.goto("https://www.maxmind.com/en/locate-my-ip-address") await botright_client.close() if __name__ == "__main__": asyncio.run(main())
Scraping Job Listings from Indeed
Overview
Indeed offers a massive amount of job listings, making it a great resource for gathering job and salary data. In this section, we’ll walk through how to scrape job listing URLs from Indeed using Playwright and Botright. We’ll first focus on collecting the URLs of individual job postings and save them for further processing. Later, we’ll dive into scraping the detailed job information.
Here is the code for scraping job listings:
import csv import asyncio from playwright.async_api import async_playwright from bs4 import BeautifulSoup from loguru import logger import botright import warnings from transformers import AutoTokenizer # Suppress the FutureWarning for transformers warnings.filterwarnings("ignore", category=FutureWarning, module="transformers") # Example usage of a tokenizer (replace with your actual model name) # Set clean_up_tokenization_spaces to True to avoid the warning tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True) # Define log and CSV filenames log_filename = "playwright_log.log" csv_filename = "scraped_job_links.csv" # Clear the log file at the beginning try: with open(log_filename, 'w'): pass # Clear log file except FileNotFoundError: pass # Set up logging with loguru logger.add(log_filename, rotation="1 week", retention="1 day", compression="zip") # Function to read existing links from CSV def read_existing_links(): existing_links = set() try: with open(csv_filename, mode='r', newline='', encoding='utf-8') as csv_file: reader = csv.DictReader(csv_file) for row in reader: existing_links.add(row['Page Link']) except FileNotFoundError: pass # File does not exist yet, so no links to read return existing_links # Function to write job data into CSV def write_to_csv(data, existing_links): try: with open(csv_filename, mode='a', newline='', encoding='utf-8') as csv_file: fieldnames = ['Job Title', 'Company Name', 'Page Link'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) # Write header if the file is empty if csv_file.tell() == 0: writer.writeheader() # Check for duplicates before writing if data['Page Link'] not in existing_links: writer.writerow(data) existing_links.add(data['Page Link']) else: logger.info(f"Duplicate found: {data['Page Link']} - Skipping entry.") except Exception as e: logger.error(f"Failed to write to CSV: {e}") async def Job_links(pages_to_scrape): logger.info("Starting Playwright with botright proxy") # Initialize Botright asynchronously botright_client = await botright.Botright() async with async_playwright() as p: # Use botright to launch Playwright browser browser = await botright_client.new_browser() page = await browser.new_page() await page.set_viewport_size({"width": 1920, "height": 1080}) try: logger.info("Navigating to https://www.indeed.com/jobs?q=work+from+home&l=Houston%2C+TX") await page.goto('https://www.indeed.com/jobs?q=work+from+home&l=Houston%2C+TX', timeout=60000) # Load existing links to avoid duplicates existing_links = read_existing_links() for page_number in range(pages_to_scrape): logger.info(f"Processing page {page_number + 1}") # Get the page content page_content = await page.content() # Parse the page content with BeautifulSoup soup = BeautifulSoup(page_content, 'html.parser') # Find all job boxes job_boxes = soup.find_all('div', class_='slider_container css-12igfu2 eu4oa1w0') for job in job_boxes: job_data = {} # Scrape Job Title try: job_title = job.find('a', class_='jcs-JobTitle').text.strip() job_data['Job Title'] = job_title logger.info(f"Job Title: {job_title}") except Exception as e: logger.error(f"Failed to scrape job title: {e}") job_data['Job Title'] = "N/A" # Scrape Company Name try: company_name = job.find('span', {'data-testid': 'company-name'}).text.strip() job_data['Company Name'] = company_name logger.info(f"Company Name: {company_name}") except Exception as e: logger.error(f"Failed to scrape company name: {e}") job_data['Company Name'] = "N/A" # Scrape Page Link try: page_link = job.find('a', class_='jcs-JobTitle')['href'] job_data['Page Link'] = 'https://www.indeed.com'+page_link logger.info(f"Page Link: {'https://www.indeed.com'+page_link}") except Exception as e: logger.error(f"Failed to scrape Page Link: {e}") job_data['Page Link'] = "N/A" # Write the job data into CSV write_to_csv(job_data, existing_links) # Navigate to the next page using JavaScript if page_number < pages_to_scrape - 1: try: # Execute JavaScript to click on the "Next" button await page.evaluate('document.querySelector("[data-testid='pagination-page-next']").click()') logger.info("Clicked on the next page button using JavaScript.") await page.wait_for_timeout(5000) except Exception as e: logger.error(f"Failed to click on the next page button using JavaScript: {e}") break logger.info("Playwright finished, closing browser.") await botright_client.close() finally: logger.info("Completed scraping job details.") if __name__ == "__main__": try: pages_to_scrape = int(input("Enter the number of pages to scrape: ")) asyncio.run(Job_links(pages_to_scrape)) except ValueError: logger.error("Invalid input. Please enter a valid number.")
Code Explanation
Importing Libraries
The script starts with importing necessary libraries like Playwright for browser automation, BeautifulSoup4 for parsing HTML content, csv for data write and loguru to manage logs. It also uses Botright to bypass anti-bot measures, such as Captchas.
Setting Up CSV and Logging
The script prepares two files before we begin scraping.
CSV File: It contains the scraped job data like job title, company name, and so on (also the job page URL).
Log File : This is used to log the scraping process including any errors and success messages.
Reading Existing Links
Use of read_existing_links() function is so as to prevent writing same data to the CSV file again and again.
Writing Data to CSV
The `write_to_csv()` function is responsible for writing the new Job data to the csv file. It takes care of writing only new job listings and avoiding duplicate entries. The output of csv will be look like this:
Extracting Job Information
For every page, the content is crawled and served to a soup object for parsing. Extract job titles, company names and job page links from the < a >
tags of each HTML.
The script looks for a tags with class name jcs-JobTitle
which contains the Job Title to extract.
Company Names: It fetches the company name using the ‘span’ tag with an attribute data-testid
.
Page Links: script pulls the job link from `a` tag and construct full URL by appending base URL (https://www.indeed.com)
Handling Pagination
After parsing all of the listings on the page, the script clicks on a button reading “Next” to go to the Next page.
Error Handling and Logging
For each step, the script includes error handling to catch all sorts of issues.
Scraping Job Details
Overview
After we have the job URLs, in turn the script goes to each job page and gets as much accurate information as possible. We scrap fields such as job titles, company name, type of the job, salary and profile link. The data is saved in the form of CSV file so that we can analyse it easily.
The code we will use for scraping job details from the individual job page :
import csv import asyncio from playwright.async_api import async_playwright from loguru import logger import botright import warnings from bs4 import BeautifulSoup from transformers import AutoTokenizer # Suppress the FutureWarning for transformers warnings.filterwarnings("ignore", category=FutureWarning, module="transformers") # Example usage of a tokenizer (replace with your actual model name) # Set clean_up_tokenization_spaces to True to avoid the warning tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True) # Define log and CSV filenames log_filename = "playwright_log.log" input_csv_filename = "scraped_job_links.csv" # Path to the input CSV file with job links output_csv_filename = "scraped_job_details.csv" # Path to the output CSV file for scraped details # Clear the log file at the beginning try: with open(log_filename, 'w'): pass # Clear log file except FileNotFoundError: pass # Set up logging with loguru logger.add(log_filename, rotation="1 week", retention="1 day", compression="zip") # Function to read job links from input CSV file def read_job_links(): job_links = [] try: with open(input_csv_filename, mode='r', newline='', encoding='utf-8') as csv_file: reader = csv.DictReader(csv_file) for row in reader: job_links.append(row['Page Link']) except FileNotFoundError: logger.error(f"Input CSV file not found: {input_csv_filename}") return job_links # Function to read existing links from the output CSV def read_existing_links(): existing_links = set() try: with open(output_csv_filename, mode='r', newline='', encoding='utf-8') as csv_file: reader = csv.DictReader(csv_file) for row in reader: existing_links.add(row['Profile Link']) except FileNotFoundError: pass # File does not exist yet, so no links to read return existing_links # Function to write job data into CSV def write_to_csv(data, existing_links): try: with open(output_csv_filename, mode='a', newline='', encoding='utf-8') as csv_file: fieldnames = ['Company Name','Job Title', 'Profile Link', 'Job Type', 'Salary'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) # Write header if the file is empty if csv_file.tell() == 0: writer.writeheader() # Check for duplicates before writing if data['Profile Link'] not in existing_links: writer.writerow(data) existing_links.add(data['Profile Link']) logger.info(f"Appended job details to CSV: {data}") else: logger.info(f"Duplicate found: {data['Profile Link']} - Skipping entry.") except Exception as e: logger.error(f"Failed to write to CSV: {e}") async def job_details_scraper(): logger.info("Starting Playwright with botright proxy") # Read job links from input CSV job_links = read_job_links() # Load existing links to avoid duplicates existing_links = read_existing_links() # Initialize Botright asynchronously botright_client = await botright.Botright() async with async_playwright() as p: # Use Botright to launch Playwright browser browser = await botright_client.new_browser() page = await browser.new_page() # Make the browser full-screen await page.set_viewport_size({"width": 1920, "height": 1080}) try: for job_link in job_links: logger.info(f"Navigating to {job_link}") await page.goto(job_link, timeout=60000) # Wait for a while to let the page fully load (adjust as needed) await page.wait_for_timeout(5000) # 5 seconds delay # Get the page content page_content = await page.content() # Parse the page content with BeautifulSoup soup = BeautifulSoup(page_content, 'html.parser') job_data = {} try: company_name = soup.select_one('.css-1ioi40n').text.strip() job_data['Company Name'] = company_name except Exception as e: logger.error(f"Failed to scrape profile_link: {e}") job_data['Company Name'] = "N/A" # Scrape job details try: job_title = soup.select_one('.css-1b4cr5z').text.strip() job_data['Job Title'] = job_title except Exception as e: logger.error(f"Failed to scrape job title: {e}") job_data['Job Title'] = "N/A" try: profile_link_element = soup.select_one('.css-1ioi40n') if profile_link_element and profile_link_element.has_attr('href'): profile_link = profile_link_element['href'] job_data['Profile Link'] = profile_link else: job_data['Profile Link'] = "N/A" logger.info("No link found with the specified selector.") except Exception as e: logger.error(f"Failed to scrape profile_link: {e}") job_data['Profile Link'] = "N/A" try: job_type = soup.select_one('.css-17cdm7w div').text.strip() job_data['Job Type'] = job_type except Exception as e: logger.error(f"Failed to scrape job type: {e}") job_data['Job Type'] = "N/A" try: salary = soup.select_one('#salaryInfoAndJobType .eu4oa1w0').text.strip() job_data['Salary'] = salary except Exception as e: logger.error(f"Failed to scrape salary: {e}") job_data['Salary'] = "N/A" # Write the job data into CSV write_to_csv(job_data, existing_links) finally: # Close browser after scraping await browser.close() await botright_client.close() if __name__ == "__main__": try: asyncio.run(job_details_scraper()) except Exception as e: logger.error(f"Error in job details scraper: {e}")
Code Explanation:
We crawl job links, then pull out necessary key information like job title, company name, salary and job type at each job page. This data is then written to a CSV so that it can be used later.
CSV Handling
read_job_links(): This method will read the job links which we have collected so far.
read_existing_links(): Read what is already there to prevent duplicates.
write_to_csv(): Adds the new job details to the CSV. The output of csv will be look like this:
For each job page, we extract:
- Job Title: Grabs the title using a CSS selector.
- Company Name: Pulls the company name similarly.
- Profile Link: Captures the job’s link if available.
- Job Type: Fetches whether the job is full-time, part-time, etc.
- Salary: If listed, we grab the salary info.
This script pulls job details efficiently, prevents duplicates, and handles CAPTCHA, ip blocking issues using Botright. It’s a reliable way to scrape job data and store it in an organized way, ready for analysis.
Conclusion
Recap
This guide has shown you the steps how to make a Indeed job scraper. We began by testing some Captcha challenges through Botright, but we never got any challenges on Indeed. This scraper was able to extract job urls along with additional data such as the name of the company, salary and title of the job, and saved all this information in a CSV file for analysis.
Future Considerations
You may wish to extend your scraper in the future so you can hit other job platforms or set it up to run on a schedule so your data is never stale. This project can be improved to an infinite level and many ways to scale.
Running the Script
If you see `io.UnsupportedOperation: fileno` error, run this script in your terminal with this command on Windows.
python script_name.py
Responses