Web Scraping Indeed with Python: Extract Job Listings and Salary Data

Learn How To Extract Job Listings And Salary Data by Scraping Indeed with Python:

click here for download source code from GitHub

Table of Contents

Overview

Indeed can be a gold mine in terms of scraping job data. Whether you are looking for job descriptions or want to know about the salary offered by a company, this raw data can be beneficial with circulating information all around in terms of job seekers, recruiters and among others. The issue however is that scraping job listings is not always straightforward. Indeed and other sites do a lot to prevent bots from slurping up their data. Web scraping job listings in general is difficult due to Captcha and IP block issues.

Challenges

Those familiar with building a web scraper have come across the same annoying problems I assume. You scrape a handful of webpages, and then—bam! Either captcha stops you in the snowballs, or worse, your IP is banned because you have requested too much. Use third-party Captcha-solving services (which become expensive and cumbersome very quickly). And all of this just kind of dumps on the life of web scraping jobs and salary data.

Botright Overview

cover2

Luckily, **Botright** is here to change the game. It’s a powerful new Python library that’s built specifically to handle these common web scraping challenges. Botright can solve Captchas on its own using AI and computer vision, which means no need for external Captcha-solving APIs. Plus, it works directly with Playwright, so if you already have code written for **web scraping job listings** or other data, Botright can be plugged right in.

You can easily install Botright with these commands:

pip install botright
playwright install

Installing BeautifulSoup: Because Botright managed to handle the JavaScript content. On completion we will use BeautifulSoup to parse through the data and extract information such as company name, salary, job title etc. It can be installed using the following command:

pip install beautifulsoup4

Botright doesn’t just look like a real browser—it actually uses a real Chromium browser on your local machine. It’s designed to avoid being detected as a bot, thanks to advanced stealth techniques. Whether you’re scraping **job listings** or other types of data, Botright has you covered. Here’s a quick example of how to get started with Botright in Playwright:

import asyncio
import botright

# Define the main asynchronous function to perform the browser actions
async def main():
    
    # Initialize Botright client to handle anti-bot measures and Captchas
    botright_client = await botright.Botright()

    # Launch a new browser session with Botright for stealth browsing
    browser = await botright_client.new_browser()

    # Open a new page (tab) in the browser
    page = await browser.new_page()

    # Navigate to the Google homepage
    await page.goto("https://google.com")

    # Close the Botright client after the tasks are done
    await botright_client.close()

# If the script is run directly, execute the main function
if __name__ == "__main__":
    asyncio.run(main())

Also be patient with the set up, python botright will download models and machine learning libraries such as pytorch. Things to remember : Atleast 5 GB free disk space ( for everything to work).

Now In this blog post, I will walk you through on how we can use Botright in order to scrape job-lists from Indeed and solve google captchas to save the data into a csv. You will understand how this can help you to scrape and also when trying to bypass some anti-scrapping protection techniques.

Botright for Google Captcha

Objective

This time, we will show you how to apply Botright to handle one of the most irritating challenges of web scraping—Google reCAPTCHA. In this guide, we are going to demonstrate how you can solve reCAPTCHA automatically using Botright and continue to scrape the websites smoothly. Here, we are using Google recaptcha demo page.

Code Implementation

This is the code that we are going to use in Botright for solving Google reCAPTCHA.

import asyncio
import botright


async def main():
    botright_client = await botright.Botright()
    browser = await botright_client.new_browser()
    page = await browser.new_page()


    # Visit a page with reCAPTCHA
    await page.goto("https://www.google.com/recaptcha/api2/demo")


    # Solve the reCAPTCHA
    await page.solve_recaptcha()


    # Wait for some time to ensure reCAPTCHA is solved
    await asyncio.sleep(2)


    # Retrieve the CAPTCHA response token from the hidden input field
    captcha_response = await page.evaluate('''() => {
        const responseField = document.querySelector('textarea[name="g-recaptcha-response"]');
        return responseField ? responseField.value : null;
    }''')


    print("CAPTCHA Response Token:", captcha_response)


    # Attempt to click the reCAPTCHA submit button
    submit_button = await page.query_selector('#recaptcha-demo-submit')
    if submit_button:
        await submit_button.click()


   
   
    await botright_client.close()




if __name__ == "__main__":
    asyncio.run(main())

Code Explanation:

botright_client = botright.Botright()

The above line is used to create a botright client here, for anything such as solving captchas and acting just like any other user.

botright_client.new_browser() This line tells Botright to open a browser (based on chromium) that has been crafted to appear and mimic human browsing behavior.

browser.new_page() And now we are opening a new tab in that browser, just like hitting “New Tab” in Chrome.

page. visit('https://www.google.com/recaptcha/api2/demo')  This one directs the browser to go to Google reCAPTCHA demo page where Botright can start solving captains.

The magic happens when we call await page.solve_recaptcha().  This command tells Botright to detect and solve the CAPTCHA automatically using own built-in AI and computer vision systems. Instead of guessing, Botright behaves like a person.

Response from CAPTCHA : After solving the CAPTCHA, the codes are waiting two seconds using asyncio.sleep(2) . It then grabs the CAPTCHA response token in the hidden input box which Google uses to verify that all Captcha got solved.

 After having CAPTCHA token, we click on the submit button; to simulate the form being submitted.

Using Proxy in Botright

When you send too many requests from the same IP, websites can block you. To avoid this, we use proxies to rotate IP addresses. By changing the IP for each request, we make our scraper harder to detect and prevent blocking. There are a lot of proxy providers available in the market. I am using Rayobyte proxy.

Here’s how you can integrate proxies with Botright:

import asyncio

import botright

from decouple import config


async def main():
    for i in range(5):
        botright_client = await botright.Botright()
        
       
        browser = await botright_client.new_browser(proxy="username:password:server_name:port")
        page = await browser.new_page()

        # Continue by using the Page
        await page.goto("https://www.maxmind.com/en/locate-my-ip-address")

        await botright_client.close()


if __name__ == "__main__":
    asyncio.run(main())

 Scraping Job Listings from Indeed

Overview

Indeed offers a massive amount of job listings, making it a great resource for gathering job and salary data. In this section, we’ll walk through how to scrape job listing URLs from Indeed using Playwright and Botright. We’ll first focus on collecting the URLs of individual job postings and save them for further processing. Later, we’ll dive into scraping the detailed job information.

Here is the code for scraping job listings:

import csv
import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from loguru import logger
import botright
import warnings
from transformers import AutoTokenizer

# Suppress the FutureWarning for transformers
warnings.filterwarnings("ignore", category=FutureWarning, module="transformers")

# Example usage of a tokenizer (replace with your actual model name)
# Set clean_up_tokenization_spaces to True to avoid the warning
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True)

# Define log and CSV filenames
log_filename = "playwright_log.log"
csv_filename = "scraped_job_links.csv"

# Clear the log file at the beginning
try:
    with open(log_filename, 'w'):
        pass  # Clear log file
except FileNotFoundError:
    pass

# Set up logging with loguru
logger.add(log_filename, rotation="1 week", retention="1 day", compression="zip")

# Function to read existing links from CSV
def read_existing_links():
    existing_links = set()
    try:
        with open(csv_filename, mode='r', newline='', encoding='utf-8') as csv_file:
            reader = csv.DictReader(csv_file)
            for row in reader:
                existing_links.add(row['Page Link'])
    except FileNotFoundError:
        pass  # File does not exist yet, so no links to read
    return existing_links

# Function to write job data into CSV
def write_to_csv(data, existing_links):
    try:
        with open(csv_filename, mode='a', newline='', encoding='utf-8') as csv_file:
            fieldnames = ['Job Title', 'Company Name', 'Page Link']
            writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
            # Write header if the file is empty
            if csv_file.tell() == 0:
                writer.writeheader()
            # Check for duplicates before writing
            if data['Page Link'] not in existing_links:
                writer.writerow(data)
                existing_links.add(data['Page Link'])
            else:
                logger.info(f"Duplicate found: {data['Page Link']} - Skipping entry.")
    except Exception as e:
        logger.error(f"Failed to write to CSV: {e}")

async def Job_links(pages_to_scrape):
    logger.info("Starting Playwright with botright proxy")

    # Initialize Botright asynchronously
    botright_client = await botright.Botright()

    async with async_playwright() as p:
        # Use botright to launch Playwright browser
        browser = await botright_client.new_browser()
        page = await browser.new_page()
        await page.set_viewport_size({"width": 1920, "height": 1080})
        
        try:
            logger.info("Navigating to https://www.indeed.com/jobs?q=work+from+home&l=Houston%2C+TX")
            await page.goto('https://www.indeed.com/jobs?q=work+from+home&l=Houston%2C+TX', timeout=60000)

            # Load existing links to avoid duplicates
            existing_links = read_existing_links()

            for page_number in range(pages_to_scrape):
                logger.info(f"Processing page {page_number + 1}")

                # Get the page content
                page_content = await page.content()
                
                # Parse the page content with BeautifulSoup
                soup = BeautifulSoup(page_content, 'html.parser')

                # Find all job boxes
                job_boxes = soup.find_all('div', class_='slider_container css-12igfu2 eu4oa1w0')

                for job in job_boxes:
                    job_data = {}

                    # Scrape Job Title
                    try:
                        job_title = job.find('a', class_='jcs-JobTitle').text.strip()
                        job_data['Job Title'] = job_title
                        logger.info(f"Job Title: {job_title}")
                    except Exception as e:
                        logger.error(f"Failed to scrape job title: {e}")
                        job_data['Job Title'] = "N/A"

                    # Scrape Company Name
                    try:
                        company_name = job.find('span', {'data-testid': 'company-name'}).text.strip()
                        job_data['Company Name'] = company_name
                        logger.info(f"Company Name: {company_name}")
                    except Exception as e:
                        logger.error(f"Failed to scrape company name: {e}")
                        job_data['Company Name'] = "N/A"

                    # Scrape Page Link
                    try:
                        page_link = job.find('a', class_='jcs-JobTitle')['href']
                        job_data['Page Link'] = 'https://www.indeed.com'+page_link
                        logger.info(f"Page Link: {'https://www.indeed.com'+page_link}")
                    except Exception as e:
                        logger.error(f"Failed to scrape Page Link: {e}")
                        job_data['Page Link'] = "N/A"

                  

                    # Write the job data into CSV
                    write_to_csv(job_data, existing_links)

                # Navigate to the next page using JavaScript
                if page_number < pages_to_scrape - 1:
                    try:
                        # Execute JavaScript to click on the "Next" button
                        await page.evaluate('document.querySelector("[data-testid='pagination-page-next']").click()')
                        logger.info("Clicked on the next page button using JavaScript.")
                        await page.wait_for_timeout(5000)
                    except Exception as e:
                        logger.error(f"Failed to click on the next page button using JavaScript: {e}")
                        break

            logger.info("Playwright finished, closing browser.")
            await botright_client.close()

        finally:
            logger.info("Completed scraping job details.")

if __name__ == "__main__":
    try:
        pages_to_scrape = int(input("Enter the number of pages to scrape: "))
        asyncio.run(Job_links(pages_to_scrape))
    except ValueError:
        logger.error("Invalid input. Please enter a valid number.")

 Code Explanation

Importing Libraries

The script starts with importing necessary libraries like Playwright for browser automation, BeautifulSoup4 for parsing HTML content, csv for data write and loguru to manage logs. It also uses Botright to bypass anti-bot measures, such as Captchas.

Setting Up CSV and Logging

The script prepares two files before we begin scraping.

CSV File: It contains the scraped job data like job title, company name, and so on (also the job page URL).

Log File : This is used to log the scraping process including any errors and success messages.

Reading Existing Links

Use of read_existing_links() function is so as to prevent writing same data to the CSV file again and again. 

Writing Data to CSV

The `write_to_csv()` function is responsible for writing the new Job data to the csv file. It takes care of writing only new job listings and avoiding duplicate entries. The output of csv will be look like this:

job_listing

Extracting Job Information

For every page, the content is crawled and served to a soup object for parsing. Extract job titles, company names and job page links from the < a > tags of each HTML.

The script looks for a tags with class name jcs-JobTitle which contains the Job Title to extract.

Company Names: It fetches the company name using the ‘span’ tag with an attribute data-testid.

Page Links: script pulls the job link from `a` tag and construct full URL by appending base URL (https://www.indeed.com)

Handling Pagination

After parsing all of the listings on the page, the script clicks on a button reading “Next” to go to the Next page. 

Error Handling and Logging

For each step, the script includes error handling to catch all sorts of issues.

 Scraping Job Details 

Overview

After we have the job URLs, in turn the script goes to each job page and gets as much accurate information as possible. We scrap fields such as job titles, company name, type of the job, salary and profile link. The data is saved in the form of CSV file so that we can analyse it easily.

The code we will use for scraping job details from the individual job page :

import csv
import asyncio
from playwright.async_api import async_playwright
from loguru import logger
import botright
import warnings
from bs4 import BeautifulSoup
from transformers import AutoTokenizer

# Suppress the FutureWarning for transformers
warnings.filterwarnings("ignore", category=FutureWarning, module="transformers")

# Example usage of a tokenizer (replace with your actual model name)
# Set clean_up_tokenization_spaces to True to avoid the warning
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True)

# Define log and CSV filenames
log_filename = "playwright_log.log"
input_csv_filename = "scraped_job_links.csv"  # Path to the input CSV file with job links
output_csv_filename = "scraped_job_details.csv"  # Path to the output CSV file for scraped details

# Clear the log file at the beginning
try:
    with open(log_filename, 'w'):
        pass  # Clear log file
except FileNotFoundError:
    pass

# Set up logging with loguru
logger.add(log_filename, rotation="1 week", retention="1 day", compression="zip")

# Function to read job links from input CSV file
def read_job_links():
    job_links = []
    try:
        with open(input_csv_filename, mode='r', newline='', encoding='utf-8') as csv_file:
            reader = csv.DictReader(csv_file)
            for row in reader:
                job_links.append(row['Page Link'])
    except FileNotFoundError:
        logger.error(f"Input CSV file not found: {input_csv_filename}")
    return job_links

# Function to read existing links from the output CSV
def read_existing_links():
    existing_links = set()
    try:
        with open(output_csv_filename, mode='r', newline='', encoding='utf-8') as csv_file:
            reader = csv.DictReader(csv_file)
            for row in reader:
                existing_links.add(row['Profile Link'])
    except FileNotFoundError:
        pass  # File does not exist yet, so no links to read
    return existing_links

# Function to write job data into CSV
def write_to_csv(data, existing_links):
    try:
        with open(output_csv_filename, mode='a', newline='', encoding='utf-8') as csv_file:
            fieldnames = ['Company Name','Job Title', 'Profile Link', 'Job Type', 'Salary']
            writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
            # Write header if the file is empty
            if csv_file.tell() == 0:
                writer.writeheader()
            # Check for duplicates before writing
            if data['Profile Link'] not in existing_links:
                writer.writerow(data)
                existing_links.add(data['Profile Link'])
                logger.info(f"Appended job details to CSV: {data}")
            else:
                logger.info(f"Duplicate found: {data['Profile Link']} - Skipping entry.")
    except Exception as e:
        logger.error(f"Failed to write to CSV: {e}")

async def job_details_scraper():
    logger.info("Starting Playwright with botright proxy")

    # Read job links from input CSV
    job_links = read_job_links()

    # Load existing links to avoid duplicates
    existing_links = read_existing_links()

    # Initialize Botright asynchronously
    botright_client = await botright.Botright()

    async with async_playwright() as p:
        # Use Botright to launch Playwright browser
        browser = await botright_client.new_browser()
        page = await browser.new_page()

        # Make the browser full-screen
        await page.set_viewport_size({"width": 1920, "height": 1080})

        try:
            for job_link in job_links:
                logger.info(f"Navigating to {job_link}")
                await page.goto(job_link, timeout=60000)

                # Wait for a while to let the page fully load (adjust as needed)
                await page.wait_for_timeout(5000)  # 5 seconds delay

                # Get the page content
                page_content = await page.content()
                
                # Parse the page content with BeautifulSoup
                soup = BeautifulSoup(page_content, 'html.parser')

                job_data = {}

                try:
                    company_name = soup.select_one('.css-1ioi40n').text.strip()
                    job_data['Company Name'] = company_name
                except Exception as e:
                    logger.error(f"Failed to scrape profile_link: {e}")
                    job_data['Company Name'] = "N/A"  

                # Scrape job details
                try:
                    job_title = soup.select_one('.css-1b4cr5z').text.strip()
                    job_data['Job Title'] = job_title
                except Exception as e:
                    logger.error(f"Failed to scrape job title: {e}")  
                    job_data['Job Title'] = "N/A"

                try:
                    profile_link_element = soup.select_one('.css-1ioi40n')
                    if profile_link_element and profile_link_element.has_attr('href'):
                        profile_link = profile_link_element['href']
                        job_data['Profile Link'] = profile_link
                    else:
                        job_data['Profile Link'] = "N/A"
                        logger.info("No link found with the specified selector.")
                except Exception as e:
                    logger.error(f"Failed to scrape profile_link: {e}")
                    job_data['Profile Link'] = "N/A"

                try:
                    job_type = soup.select_one('.css-17cdm7w div').text.strip()
                    job_data['Job Type'] = job_type
                except Exception as e:
                    logger.error(f"Failed to scrape job type: {e}")
                    job_data['Job Type'] = "N/A"

                try:
                    salary = soup.select_one('#salaryInfoAndJobType .eu4oa1w0').text.strip()
                    job_data['Salary'] = salary
                except Exception as e:
                    logger.error(f"Failed to scrape salary: {e}")
                    job_data['Salary'] = "N/A"

                # Write the job data into CSV
                write_to_csv(job_data, existing_links)

        finally:
            # Close browser after scraping
            await browser.close()
            await botright_client.close()

if __name__ == "__main__":
    try:
        asyncio.run(job_details_scraper())
    except Exception as e:
        logger.error(f"Error in job details scraper: {e}")

Code Explanation: 

We crawl job links, then pull out necessary key information like job title, company name, salary and job type at each job page. This data is then written to a CSV so that it can be used later.

CSV Handling

 read_job_links(): This method will read the job links which we have collected so far.

read_existing_links(): Read what is already there to prevent duplicates.

write_to_csv(): Adds the new job details to the CSV. The output of csv will be look like this:

job details

For each job page, we extract:

  • Job Title: Grabs the title using a CSS selector.
  • Company Name: Pulls the company name similarly.
  • Profile Link: Captures the job’s link if available.
  • Job Type: Fetches whether the job is full-time, part-time, etc.
  • Salary: If listed, we grab the salary info.

This script pulls job details efficiently, prevents duplicates, and handles CAPTCHA, ip blocking issues using Botright. It’s a reliable way to scrape job data and store it in an organized way, ready for analysis.

Conclusion

Recap

This guide has shown you the steps how to make a Indeed job scraper. We began by testing some Captcha challenges through Botright, but we never got any challenges on Indeed. This scraper was able to extract job urls along with additional data such as the name of the company, salary and title of the job, and saved all this information in a CSV file for analysis.

Future Considerations

You may wish to extend your scraper in the future so you can hit other job platforms or set it up to run on a schedule so your data is never stale. This project can be improved to an infinite level and many ways to scale.

Running the Script

 If you see `io.UnsupportedOperation: fileno`  error, run this script in your terminal with this command on Windows.

python  script_name.py

click here for download source code from GitHub

watch the tutorial on youtube 

Responses

Related Projects

google shopping scraper python
yahoo search
Bing search 1
b9929b09 167f 4365 9087 fddf3278a679