Learn How To Extract Job Listings And Salary Data by Scraping Indeed with Python:

click here for download source code from GitHub

Overview
Challenges
Botright Overview
Botright for Google Captcha
Using Proxy in Botright
Scraping Job Listings from Indeed
Scraping Job Details
Conclusion

Overview

Indeed can be a gold mine in terms of scraping job data. Whether you are looking for job descriptions or want to know about the salary offered by a company, this raw data can be beneficial with circulating information all around in terms of job seekers, recruiters and among others. The issue however is that scraping job listings is not always straightforward. Indeed and other sites do a lot to prevent bots from slurping up their data. Web scraping job listings in general is difficult due to Captcha and IP block issues.

Challenges

Those familiar with building a web scraper have come across the same annoying problems I assume. You scrape a handful of webpages, and then—bam! Either captcha stops you in the snowballs, or worse, your IP is banned because you have requested too much. Use third-party Captcha-solving services (which become expensive and cumbersome very quickly). And all of this just kind of dumps on the life of web scraping jobs and salary data.

Botright Overview

cover2

Luckily, **Botright** is here to change the game. It’s a powerful new Python library that’s built specifically to handle these common web scraping challenges. Botright can solve Captchas on its own using AI and computer vision, which means no need for external Captcha-solving APIs. Plus, it works directly with Playwright, so if you already have code written for **web scraping job listings** or other data, Botright can be plugged right in.

You can easily install Botright with these commands:

pip install botright

playwright install

pip install botright playwright install

pip install botright
playwright install

Installing BeautifulSoup: Because Botright managed to handle the JavaScript content. On completion we will use BeautifulSoup to parse through the data and extract information such as company name, salary, job title etc. It can be installed using the following command:

pip install beautifulsoup4

pip install beautifulsoup4

Botright doesn’t just look like a real browser—it actually uses a real Chromium browser on your local machine. It’s designed to avoid being detected as a bot, thanks to advanced stealth techniques. Whether you’re scraping **job listings** or other types of data, Botright has you covered. Here’s a quick example of how to get started with Botright in Playwright:

import asyncio

import botright

# Define the main asynchronous function to perform the browser actions

async def main():

# Initialize Botright client to handle anti-bot measures and Captchas

botright_client = await botright.Botright()

# Launch a new browser session with Botright for stealth browsing

browser = await botright_client.new_browser()

# Open a new page (tab) in the browser

page = await browser.new_page()

# Navigate to the Google homepage

await page.goto("https://google.com")

# Close the Botright client after the tasks are done

await botright_client.close()

# If the script is run directly, execute the main function

if __name__ == "__main__":

asyncio.run(main())

import asyncio import botright # Define the main asynchronous function to perform the browser actions async def main(): # Initialize Botright client to handle anti-bot measures and Captchas botright_client = await botright.Botright() # Launch a new browser session with Botright for stealth browsing browser = await botright_client.new_browser() # Open a new page (tab) in the browser page = await browser.new_page() # Navigate to the Google homepage await page.goto("https://google.com") # Close the Botright client after the tasks are done await botright_client.close() # If the script is run directly, execute the main function if __name__ == "__main__": asyncio.run(main())

import asyncio
import botright

# Define the main asynchronous function to perform the browser actions
async def main():
    
    # Initialize Botright client to handle anti-bot measures and Captchas
    botright_client = await botright.Botright()

    # Launch a new browser session with Botright for stealth browsing
    browser = await botright_client.new_browser()

    # Open a new page (tab) in the browser
    page = await browser.new_page()

    # Navigate to the Google homepage
    await page.goto("https://google.com")

    # Close the Botright client after the tasks are done
    await botright_client.close()

# If the script is run directly, execute the main function
if __name__ == "__main__":
    asyncio.run(main())

Also be patient with the set up, python botright will download models and machine learning libraries such as pytorch. Things to remember : Atleast 5 GB free disk space ( for everything to work).

Now In this blog post, I will walk you through on how we can use Botright in order to scrape job-lists from Indeed and solve google captchas to save the data into a csv. You will understand how this can help you to scrape and also when trying to bypass some anti-scrapping protection techniques.

Botright for Google Captcha

Objective

This time, we will show you how to apply Botright to handle one of the most irritating challenges of web scraping—Google reCAPTCHA. In this guide, we are going to demonstrate how you can solve reCAPTCHA automatically using Botright and continue to scrape the websites smoothly. Here, we are using Google recaptcha demo page.

Code Implementation

This is the code that we are going to use in Botright for solving Google reCAPTCHA.

import asyncio

import botright

async def main():

botright_client = await botright.Botright()

browser = await botright_client.new_browser()

page = await browser.new_page()

# Visit a page with reCAPTCHA

await page.goto("https://www.google.com/recaptcha/api2/demo")

# Solve the reCAPTCHA

await page.solve_recaptcha()

# Wait for some time to ensure reCAPTCHA is solved

await asyncio.sleep(2)

# Retrieve the CAPTCHA response token from the hidden input field

captcha_response = await page.evaluate('''() => {

const responseField = document.querySelector('textarea[name="g-recaptcha-response"]');

return responseField ? responseField.value : null;

}''')

print("CAPTCHA Response Token:", captcha_response)

# Attempt to click the reCAPTCHA submit button

submit_button = await page.query_selector('#recaptcha-demo-submit')

if submit_button:

await submit_button.click()

await botright_client.close()

if __name__ == "__main__":

asyncio.run(main())

import asyncio import botright async def main(): botright_client = await botright.Botright() browser = await botright_client.new_browser() page = await browser.new_page() # Visit a page with reCAPTCHA await page.goto("https://www.google.com/recaptcha/api2/demo") # Solve the reCAPTCHA await page.solve_recaptcha() # Wait for some time to ensure reCAPTCHA is solved await asyncio.sleep(2) # Retrieve the CAPTCHA response token from the hidden input field captcha_response = await page.evaluate('''() => { const responseField = document.querySelector('textarea[name="g-recaptcha-response"]'); return responseField ? responseField.value : null; }''') print("CAPTCHA Response Token:", captcha_response) # Attempt to click the reCAPTCHA submit button submit_button = await page.query_selector('#recaptcha-demo-submit') if submit_button: await submit_button.click() await botright_client.close() if __name__ == "__main__": asyncio.run(main())

import asyncio
import botright


async def main():
    botright_client = await botright.Botright()
    browser = await botright_client.new_browser()
    page = await browser.new_page()


    # Visit a page with reCAPTCHA
    await page.goto("https://www.google.com/recaptcha/api2/demo")


    # Solve the reCAPTCHA
    await page.solve_recaptcha()


    # Wait for some time to ensure reCAPTCHA is solved
    await asyncio.sleep(2)


    # Retrieve the CAPTCHA response token from the hidden input field
    captcha_response = await page.evaluate('''() => {
        const responseField = document.querySelector('textarea[name="g-recaptcha-response"]');
        return responseField ? responseField.value : null;
    }''')


    print("CAPTCHA Response Token:", captcha_response)


    # Attempt to click the reCAPTCHA submit button
    submit_button = await page.query_selector('#recaptcha-demo-submit')
    if submit_button:
        await submit_button.click()


   
   
    await botright_client.close()




if __name__ == "__main__":
    asyncio.run(main())

Code Explanation:

botright_client = botright.Botright()

The above line is used to create a botright client here, for anything such as solving captchas and acting just like any other user.

botright_client.new_browser() This line tells Botright to open a browser (based on chromium) that has been crafted to appear and mimic human browsing behavior.

browser.new_page() And now we are opening a new tab in that browser, just like hitting “New Tab” in Chrome.

page. visit('https://www.google.com/recaptcha/api2/demo')

page. visit('https://www.google.com/recaptcha/api2/demo') This one directs the browser to go to Google reCAPTCHA demo page where Botright can start solving captains.

The magic happens when we call await page.solve_recaptcha(). This command tells Botright to detect and solve the CAPTCHA automatically using own built-in AI and computer vision systems. Instead of guessing, Botright behaves like a person.

Response from CAPTCHA : After solving the CAPTCHA, the codes are waiting two seconds using asyncio.sleep(2) . It then grabs the CAPTCHA response token in the hidden input box which Google uses to verify that all Captcha got solved.

After having CAPTCHA token, we click on the submit button; to simulate the form being submitted.

Using Proxy in Botright

When you send too many requests from the same IP, websites can block you. To avoid this, we use proxies to rotate IP addresses. By changing the IP for each request, we make our scraper harder to detect and prevent blocking. There are a lot of proxy providers available in the market. I am using Rayobyte proxy.

Here’s how you can integrate proxies with Botright:

import asyncio

import botright

from decouple import config

async def main():

for i in range(5):

botright_client = await botright.Botright()

browser = await botright_client.new_browser(proxy="username:password:server_name:port")

page = await browser.new_page()

# Continue by using the Page

await page.goto("https://www.maxmind.com/en/locate-my-ip-address")

await botright_client.close()

if __name__ == "__main__":

asyncio.run(main())

import asyncio import botright from decouple import config async def main(): for i in range(5): botright_client = await botright.Botright() browser = await botright_client.new_browser(proxy="username:password:server_name:port") page = await browser.new_page() # Continue by using the Page await page.goto("https://www.maxmind.com/en/locate-my-ip-address") await botright_client.close() if __name__ == "__main__": asyncio.run(main())

import asyncio

import botright

from decouple import config


async def main():
    for i in range(5):
        botright_client = await botright.Botright()
        
       
        browser = await botright_client.new_browser(proxy="username:password:server_name:port")
        page = await browser.new_page()

        # Continue by using the Page
        await page.goto("https://www.maxmind.com/en/locate-my-ip-address")

        await botright_client.close()


if __name__ == "__main__":
    asyncio.run(main())

Scraping Job Listings from Indeed

Overview

Indeed offers a massive amount of job listings, making it a great resource for gathering job and salary data. In this section, we’ll walk through how to scrape job listing URLs from Indeed using Playwright and Botright. We’ll first focus on collecting the URLs of individual job postings and save them for further processing. Later, we’ll dive into scraping the detailed job information.

Here is the code for scraping job listings:

import csv

import asyncio

from playwright.async_api import async_playwright

from bs4 import BeautifulSoup

from loguru import logger

import botright

import warnings

from transformers import AutoTokenizer

# Suppress the FutureWarning for transformers

warnings.filterwarnings("ignore", category=FutureWarning, module="transformers")

# Example usage of a tokenizer (replace with your actual model name)

# Set clean_up_tokenization_spaces to True to avoid the warning

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True)

# Define log and CSV filenames

log_filename = "playwright_log.log"

csv_filename = "scraped_job_links.csv"

# Clear the log file at the beginning

try:

with open(log_filename, 'w'):

pass # Clear log file

except FileNotFoundError:

pass

# Set up logging with loguru

logger.add(log_filename, rotation="1 week", retention="1 day", compression="zip")

# Function to read existing links from CSV

def read_existing_links():

existing_links = set()

try:

with open(csv_filename, mode='r', newline='', encoding='utf-8') as csv_file:

reader = csv.DictReader(csv_file)

for row in reader:

existing_links.add(row['Page Link'])

except FileNotFoundError:

pass # File does not exist yet, so no links to read

return existing_links

# Function to write job data into CSV

def write_to_csv(data, existing_links):

try:

with open(csv_filename, mode='a', newline='', encoding='utf-8') as csv_file:

fieldnames = ['Job Title', 'Company Name', 'Page Link']

writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

# Write header if the file is empty

if csv_file.tell() == 0:

writer.writeheader()

# Check for duplicates before writing

if data['Page Link'] not in existing_links:

writer.writerow(data)

existing_links.add(data['Page Link'])

else:

logger.info(f"Duplicate found: {data['Page Link']} - Skipping entry.")

except Exception as e:

logger.error(f"Failed to write to CSV: {e}")

async def Job_links(pages_to_scrape):

logger.info("Starting Playwright with botright proxy")

# Initialize Botright asynchronously

botright_client = await botright.Botright()

async with async_playwright() as p:

# Use botright to launch Playwright browser

browser = await botright_client.new_browser()

page = await browser.new_page()

await page.set_viewport_size({"width": 1920, "height": 1080})

try:

logger.info("Navigating to https://www.indeed.com/jobs?q=work+from+home&l=Houston%2C+TX")

await page.goto('https://www.indeed.com/jobs?q=work+from+home&l=Houston%2C+TX', timeout=60000)

# Load existing links to avoid duplicates

existing_links = read_existing_links()

for page_number in range(pages_to_scrape):

logger.info(f"Processing page {page_number + 1}")

# Get the page content

page_content = await page.content()

# Parse the page content with BeautifulSoup

soup = BeautifulSoup(page_content, 'html.parser')

# Find all job boxes

job_boxes = soup.find_all('div', class_='slider_container css-12igfu2 eu4oa1w0')

for job in job_boxes:

job_data = {}

# Scrape Job Title

try:

job_title = job.find('a', class_='jcs-JobTitle').text.strip()

job_data['Job Title'] = job_title

logger.info(f"Job Title: {job_title}")

except Exception as e:

logger.error(f"Failed to scrape job title: {e}")

job_data['Job Title'] = "N/A"

# Scrape Company Name

try:

company_name = job.find('span', {'data-testid': 'company-name'}).text.strip()

job_data['Company Name'] = company_name

logger.info(f"Company Name: {company_name}")

except Exception as e:

logger.error(f"Failed to scrape company name: {e}")

job_data['Company Name'] = "N/A"

# Scrape Page Link

try:

page_link = job.find('a', class_='jcs-JobTitle')['href']

job_data['Page Link'] = 'https://www.indeed.com'+page_link

logger.info(f"Page Link: {'https://www.indeed.com'+page_link}")

except Exception as e:

logger.error(f"Failed to scrape Page Link: {e}")

job_data['Page Link'] = "N/A"

# Write the job data into CSV

write_to_csv(job_data, existing_links)

# Navigate to the next page using JavaScript

if page_number < pages_to_scrape - 1:

try:

# Execute JavaScript to click on the "Next" button

await page.evaluate('document.querySelector("[data-testid='pagination-page-next']").click()')

logger.info("Clicked on the next page button using JavaScript.")

await page.wait_for_timeout(5000)

except Exception as e:

logger.error(f"Failed to click on the next page button using JavaScript: {e}")

break

logger.info("Playwright finished, closing browser.")

await botright_client.close()

finally:

logger.info("Completed scraping job details.")

if __name__ == "__main__":

try:

pages_to_scrape = int(input("Enter the number of pages to scrape: "))

asyncio.run(Job_links(pages_to_scrape))

except ValueError:

logger.error("Invalid input. Please enter a valid number.")

import csv import asyncio from playwright.async_api import async_playwright from bs4 import BeautifulSoup from loguru import logger import botright import warnings from transformers import AutoTokenizer # Suppress the FutureWarning for transformers warnings.filterwarnings("ignore", category=FutureWarning, module="transformers") # Example usage of a tokenizer (replace with your actual model name) # Set clean_up_tokenization_spaces to True to avoid the warning tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True) # Define log and CSV filenames log_filename = "playwright_log.log" csv_filename = "scraped_job_links.csv" # Clear the log file at the beginning try: with open(log_filename, 'w'): pass # Clear log file except FileNotFoundError: pass # Set up logging with loguru logger.add(log_filename, rotation="1 week", retention="1 day", compression="zip") # Function to read existing links from CSV def read_existing_links(): existing_links = set() try: with open(csv_filename, mode='r', newline='', encoding='utf-8') as csv_file: reader = csv.DictReader(csv_file) for row in reader: existing_links.add(row['Page Link']) except FileNotFoundError: pass # File does not exist yet, so no links to read return existing_links # Function to write job data into CSV def write_to_csv(data, existing_links): try: with open(csv_filename, mode='a', newline='', encoding='utf-8') as csv_file: fieldnames = ['Job Title', 'Company Name', 'Page Link'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) # Write header if the file is empty if csv_file.tell() == 0: writer.writeheader() # Check for duplicates before writing if data['Page Link'] not in existing_links: writer.writerow(data) existing_links.add(data['Page Link']) else: logger.info(f"Duplicate found: {data['Page Link']} - Skipping entry.") except Exception as e: logger.error(f"Failed to write to CSV: {e}") async def Job_links(pages_to_scrape): logger.info("Starting Playwright with botright proxy") # Initialize Botright asynchronously botright_client = await botright.Botright() async with async_playwright() as p: # Use botright to launch Playwright browser browser = await botright_client.new_browser() page = await browser.new_page() await page.set_viewport_size({"width": 1920, "height": 1080}) try: logger.info("Navigating to https://www.indeed.com/jobs?q=work+from+home&l=Houston%2C+TX") await page.goto('https://www.indeed.com/jobs?q=work+from+home&l=Houston%2C+TX', timeout=60000) # Load existing links to avoid duplicates existing_links = read_existing_links() for page_number in range(pages_to_scrape): logger.info(f"Processing page {page_number + 1}") # Get the page content page_content = await page.content() # Parse the page content with BeautifulSoup soup = BeautifulSoup(page_content, 'html.parser') # Find all job boxes job_boxes = soup.find_all('div', class_='slider_container css-12igfu2 eu4oa1w0') for job in job_boxes: job_data = {} # Scrape Job Title try: job_title = job.find('a', class_='jcs-JobTitle').text.strip() job_data['Job Title'] = job_title logger.info(f"Job Title: {job_title}") except Exception as e: logger.error(f"Failed to scrape job title: {e}") job_data['Job Title'] = "N/A" # Scrape Company Name try: company_name = job.find('span', {'data-testid': 'company-name'}).text.strip() job_data['Company Name'] = company_name logger.info(f"Company Name: {company_name}") except Exception as e: logger.error(f"Failed to scrape company name: {e}") job_data['Company Name'] = "N/A" # Scrape Page Link try: page_link = job.find('a', class_='jcs-JobTitle')['href'] job_data['Page Link'] = 'https://www.indeed.com'+page_link logger.info(f"Page Link: {'https://www.indeed.com'+page_link}") except Exception as e: logger.error(f"Failed to scrape Page Link: {e}") job_data['Page Link'] = "N/A" # Write the job data into CSV write_to_csv(job_data, existing_links) # Navigate to the next page using JavaScript if page_number < pages_to_scrape - 1: try: # Execute JavaScript to click on the "Next" button await page.evaluate('document.querySelector("[data-testid='pagination-page-next']").click()') logger.info("Clicked on the next page button using JavaScript.") await page.wait_for_timeout(5000) except Exception as e: logger.error(f"Failed to click on the next page button using JavaScript: {e}") break logger.info("Playwright finished, closing browser.") await botright_client.close() finally: logger.info("Completed scraping job details.") if __name__ == "__main__": try: pages_to_scrape = int(input("Enter the number of pages to scrape: ")) asyncio.run(Job_links(pages_to_scrape)) except ValueError: logger.error("Invalid input. Please enter a valid number.")

import csv
import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from loguru import logger
import botright
import warnings
from transformers import AutoTokenizer

# Suppress the FutureWarning for transformers
warnings.filterwarnings("ignore", category=FutureWarning, module="transformers")

# Example usage of a tokenizer (replace with your actual model name)
# Set clean_up_tokenization_spaces to True to avoid the warning
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True)

# Define log and CSV filenames
log_filename = "playwright_log.log"
csv_filename = "scraped_job_links.csv"

# Clear the log file at the beginning
try:
    with open(log_filename, 'w'):
        pass  # Clear log file
except FileNotFoundError:
    pass

# Set up logging with loguru
logger.add(log_filename, rotation="1 week", retention="1 day", compression="zip")

# Function to read existing links from CSV
def read_existing_links():
    existing_links = set()
    try:
        with open(csv_filename, mode='r', newline='', encoding='utf-8') as csv_file:
            reader = csv.DictReader(csv_file)
            for row in reader:
                existing_links.add(row['Page Link'])
    except FileNotFoundError:
        pass  # File does not exist yet, so no links to read
    return existing_links

# Function to write job data into CSV
def write_to_csv(data, existing_links):
    try:
        with open(csv_filename, mode='a', newline='', encoding='utf-8') as csv_file:
            fieldnames = ['Job Title', 'Company Name', 'Page Link']
            writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
            # Write header if the file is empty
            if csv_file.tell() == 0:
                writer.writeheader()
            # Check for duplicates before writing
            if data['Page Link'] not in existing_links:
                writer.writerow(data)
                existing_links.add(data['Page Link'])
            else:
                logger.info(f"Duplicate found: {data['Page Link']} - Skipping entry.")
    except Exception as e:
        logger.error(f"Failed to write to CSV: {e}")

async def Job_links(pages_to_scrape):
    logger.info("Starting Playwright with botright proxy")

    # Initialize Botright asynchronously
    botright_client = await botright.Botright()

    async with async_playwright() as p:
        # Use botright to launch Playwright browser
        browser = await botright_client.new_browser()
        page = await browser.new_page()
        await page.set_viewport_size({"width": 1920, "height": 1080})
        
        try:
            logger.info("Navigating to https://www.indeed.com/jobs?q=work+from+home&l=Houston%2C+TX")
            await page.goto('https://www.indeed.com/jobs?q=work+from+home&l=Houston%2C+TX', timeout=60000)

            # Load existing links to avoid duplicates
            existing_links = read_existing_links()

            for page_number in range(pages_to_scrape):
                logger.info(f"Processing page {page_number + 1}")

                # Get the page content
                page_content = await page.content()
                
                # Parse the page content with BeautifulSoup
                soup = BeautifulSoup(page_content, 'html.parser')

                # Find all job boxes
                job_boxes = soup.find_all('div', class_='slider_container css-12igfu2 eu4oa1w0')

                for job in job_boxes:
                    job_data = {}

                    # Scrape Job Title
                    try:
                        job_title = job.find('a', class_='jcs-JobTitle').text.strip()
                        job_data['Job Title'] = job_title
                        logger.info(f"Job Title: {job_title}")
                    except Exception as e:
                        logger.error(f"Failed to scrape job title: {e}")
                        job_data['Job Title'] = "N/A"

                    # Scrape Company Name
                    try:
                        company_name = job.find('span', {'data-testid': 'company-name'}).text.strip()
                        job_data['Company Name'] = company_name
                        logger.info(f"Company Name: {company_name}")
                    except Exception as e:
                        logger.error(f"Failed to scrape company name: {e}")
                        job_data['Company Name'] = "N/A"

                    # Scrape Page Link
                    try:
                        page_link = job.find('a', class_='jcs-JobTitle')['href']
                        job_data['Page Link'] = 'https://www.indeed.com'+page_link
                        logger.info(f"Page Link: {'https://www.indeed.com'+page_link}")
                    except Exception as e:
                        logger.error(f"Failed to scrape Page Link: {e}")
                        job_data['Page Link'] = "N/A"

                  

                    # Write the job data into CSV
                    write_to_csv(job_data, existing_links)

                # Navigate to the next page using JavaScript
                if page_number < pages_to_scrape - 1:
                    try:
                        # Execute JavaScript to click on the "Next" button
                        await page.evaluate('document.querySelector("[data-testid='pagination-page-next']").click()')
                        logger.info("Clicked on the next page button using JavaScript.")
                        await page.wait_for_timeout(5000)
                    except Exception as e:
                        logger.error(f"Failed to click on the next page button using JavaScript: {e}")
                        break

            logger.info("Playwright finished, closing browser.")
            await botright_client.close()

        finally:
            logger.info("Completed scraping job details.")

if __name__ == "__main__":
    try:
        pages_to_scrape = int(input("Enter the number of pages to scrape: "))
        asyncio.run(Job_links(pages_to_scrape))
    except ValueError:
        logger.error("Invalid input. Please enter a valid number.")

Code Explanation

Importing Libraries

The script starts with importing necessary libraries like Playwright for browser automation, BeautifulSoup4 for parsing HTML content, csv for data write and loguru to manage logs. It also uses Botright to bypass anti-bot measures, such as Captchas.

Setting Up CSV and Logging

The script prepares two files before we begin scraping.

CSV File: It contains the scraped job data like job title, company name, and so on (also the job page URL).

Log File : This is used to log the scraping process including any errors and success messages.

Reading Existing Links

Use of read_existing_links() function is so as to prevent writing same data to the CSV file again and again.

Writing Data to CSV

The `write_to_csv()` function is responsible for writing the new Job data to the csv file. It takes care of writing only new job listings and avoiding duplicate entries. The output of csv will be look like this:

job_listing

Extracting Job Information

For every page, the content is crawled and served to a soup object for parsing. Extract job titles, company names and job page links from the < a > tags of each HTML.

The script looks for a tags with class name jcs-JobTitle which contains the Job Title to extract.

Company Names: It fetches the company name using the ‘span’ tag with an attribute data-testid.

Page Links: script pulls the job link from `a` tag and construct full URL by appending base URL (https://www.indeed.com)

Handling Pagination

After parsing all of the listings on the page, the script clicks on a button reading “Next” to go to the Next page.

Error Handling and Logging

For each step, the script includes error handling to catch all sorts of issues.

Scraping Job Details

Overview

After we have the job URLs, in turn the script goes to each job page and gets as much accurate information as possible. We scrap fields such as job titles, company name, type of the job, salary and profile link. The data is saved in the form of CSV file so that we can analyse it easily.

The code we will use for scraping job details from the individual job page :

import csv

import asyncio

from playwright.async_api import async_playwright

from loguru import logger

import botright

import warnings

from bs4 import BeautifulSoup

from transformers import AutoTokenizer

# Suppress the FutureWarning for transformers

warnings.filterwarnings("ignore", category=FutureWarning, module="transformers")

# Example usage of a tokenizer (replace with your actual model name)

# Set clean_up_tokenization_spaces to True to avoid the warning

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True)

# Define log and CSV filenames

log_filename = "playwright_log.log"

input_csv_filename = "scraped_job_links.csv" # Path to the input CSV file with job links

output_csv_filename = "scraped_job_details.csv" # Path to the output CSV file for scraped details

# Clear the log file at the beginning

try:

with open(log_filename, 'w'):

pass # Clear log file

except FileNotFoundError:

pass

# Set up logging with loguru

logger.add(log_filename, rotation="1 week", retention="1 day", compression="zip")

# Function to read job links from input CSV file

def read_job_links():

job_links = []

try:

with open(input_csv_filename, mode='r', newline='', encoding='utf-8') as csv_file:

reader = csv.DictReader(csv_file)

for row in reader:

job_links.append(row['Page Link'])

except FileNotFoundError:

logger.error(f"Input CSV file not found: {input_csv_filename}")

return job_links

# Function to read existing links from the output CSV

def read_existing_links():

existing_links = set()

try:

with open(output_csv_filename, mode='r', newline='', encoding='utf-8') as csv_file:

reader = csv.DictReader(csv_file)

for row in reader:

existing_links.add(row['Profile Link'])

except FileNotFoundError:

pass # File does not exist yet, so no links to read

return existing_links

# Function to write job data into CSV

def write_to_csv(data, existing_links):

try:

with open(output_csv_filename, mode='a', newline='', encoding='utf-8') as csv_file:

fieldnames = ['Company Name','Job Title', 'Profile Link', 'Job Type', 'Salary']

writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

# Write header if the file is empty

if csv_file.tell() == 0:

writer.writeheader()

# Check for duplicates before writing

if data['Profile Link'] not in existing_links:

writer.writerow(data)

existing_links.add(data['Profile Link'])

logger.info(f"Appended job details to CSV: {data}")

else:

logger.info(f"Duplicate found: {data['Profile Link']} - Skipping entry.")

except Exception as e:

logger.error(f"Failed to write to CSV: {e}")

async def job_details_scraper():

logger.info("Starting Playwright with botright proxy")

# Read job links from input CSV

job_links = read_job_links()

# Load existing links to avoid duplicates

existing_links = read_existing_links()

# Initialize Botright asynchronously

botright_client = await botright.Botright()

async with async_playwright() as p:

# Use Botright to launch Playwright browser

browser = await botright_client.new_browser()

page = await browser.new_page()

# Make the browser full-screen

await page.set_viewport_size({"width": 1920, "height": 1080})

try:

for job_link in job_links:

logger.info(f"Navigating to {job_link}")

await page.goto(job_link, timeout=60000)

# Wait for a while to let the page fully load (adjust as needed)

await page.wait_for_timeout(5000) # 5 seconds delay

# Get the page content

page_content = await page.content()

# Parse the page content with BeautifulSoup

soup = BeautifulSoup(page_content, 'html.parser')

job_data = {}

try:

company_name = soup.select_one('.css-1ioi40n').text.strip()

job_data['Company Name'] = company_name

except Exception as e:

logger.error(f"Failed to scrape profile_link: {e}")

job_data['Company Name'] = "N/A"

# Scrape job details

try:

job_title = soup.select_one('.css-1b4cr5z').text.strip()

job_data['Job Title'] = job_title

except Exception as e:

logger.error(f"Failed to scrape job title: {e}")

job_data['Job Title'] = "N/A"

try:

profile_link_element = soup.select_one('.css-1ioi40n')

if profile_link_element and profile_link_element.has_attr('href'):

profile_link = profile_link_element['href']

job_data['Profile Link'] = profile_link

else:

job_data['Profile Link'] = "N/A"

logger.info("No link found with the specified selector.")

except Exception as e:

logger.error(f"Failed to scrape profile_link: {e}")

job_data['Profile Link'] = "N/A"

try:

job_type = soup.select_one('.css-17cdm7w div').text.strip()

job_data['Job Type'] = job_type

except Exception as e:

logger.error(f"Failed to scrape job type: {e}")

job_data['Job Type'] = "N/A"

try:

salary = soup.select_one('#salaryInfoAndJobType .eu4oa1w0').text.strip()

job_data['Salary'] = salary

except Exception as e:

logger.error(f"Failed to scrape salary: {e}")

job_data['Salary'] = "N/A"

# Write the job data into CSV

write_to_csv(job_data, existing_links)

finally:

# Close browser after scraping

await browser.close()

await botright_client.close()

if __name__ == "__main__":

try:

asyncio.run(job_details_scraper())

except Exception as e:

logger.error(f"Error in job details scraper: {e}")

import csv import asyncio from playwright.async_api import async_playwright from loguru import logger import botright import warnings from bs4 import BeautifulSoup from transformers import AutoTokenizer # Suppress the FutureWarning for transformers warnings.filterwarnings("ignore", category=FutureWarning, module="transformers") # Example usage of a tokenizer (replace with your actual model name) # Set clean_up_tokenization_spaces to True to avoid the warning tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True) # Define log and CSV filenames log_filename = "playwright_log.log" input_csv_filename = "scraped_job_links.csv" # Path to the input CSV file with job links output_csv_filename = "scraped_job_details.csv" # Path to the output CSV file for scraped details # Clear the log file at the beginning try: with open(log_filename, 'w'): pass # Clear log file except FileNotFoundError: pass # Set up logging with loguru logger.add(log_filename, rotation="1 week", retention="1 day", compression="zip") # Function to read job links from input CSV file def read_job_links(): job_links = [] try: with open(input_csv_filename, mode='r', newline='', encoding='utf-8') as csv_file: reader = csv.DictReader(csv_file) for row in reader: job_links.append(row['Page Link']) except FileNotFoundError: logger.error(f"Input CSV file not found: {input_csv_filename}") return job_links # Function to read existing links from the output CSV def read_existing_links(): existing_links = set() try: with open(output_csv_filename, mode='r', newline='', encoding='utf-8') as csv_file: reader = csv.DictReader(csv_file) for row in reader: existing_links.add(row['Profile Link']) except FileNotFoundError: pass # File does not exist yet, so no links to read return existing_links # Function to write job data into CSV def write_to_csv(data, existing_links): try: with open(output_csv_filename, mode='a', newline='', encoding='utf-8') as csv_file: fieldnames = ['Company Name','Job Title', 'Profile Link', 'Job Type', 'Salary'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) # Write header if the file is empty if csv_file.tell() == 0: writer.writeheader() # Check for duplicates before writing if data['Profile Link'] not in existing_links: writer.writerow(data) existing_links.add(data['Profile Link']) logger.info(f"Appended job details to CSV: {data}") else: logger.info(f"Duplicate found: {data['Profile Link']} - Skipping entry.") except Exception as e: logger.error(f"Failed to write to CSV: {e}") async def job_details_scraper(): logger.info("Starting Playwright with botright proxy") # Read job links from input CSV job_links = read_job_links() # Load existing links to avoid duplicates existing_links = read_existing_links() # Initialize Botright asynchronously botright_client = await botright.Botright() async with async_playwright() as p: # Use Botright to launch Playwright browser browser = await botright_client.new_browser() page = await browser.new_page() # Make the browser full-screen await page.set_viewport_size({"width": 1920, "height": 1080}) try: for job_link in job_links: logger.info(f"Navigating to {job_link}") await page.goto(job_link, timeout=60000) # Wait for a while to let the page fully load (adjust as needed) await page.wait_for_timeout(5000) # 5 seconds delay # Get the page content page_content = await page.content() # Parse the page content with BeautifulSoup soup = BeautifulSoup(page_content, 'html.parser') job_data = {} try: company_name = soup.select_one('.css-1ioi40n').text.strip() job_data['Company Name'] = company_name except Exception as e: logger.error(f"Failed to scrape profile_link: {e}") job_data['Company Name'] = "N/A" # Scrape job details try: job_title = soup.select_one('.css-1b4cr5z').text.strip() job_data['Job Title'] = job_title except Exception as e: logger.error(f"Failed to scrape job title: {e}") job_data['Job Title'] = "N/A" try: profile_link_element = soup.select_one('.css-1ioi40n') if profile_link_element and profile_link_element.has_attr('href'): profile_link = profile_link_element['href'] job_data['Profile Link'] = profile_link else: job_data['Profile Link'] = "N/A" logger.info("No link found with the specified selector.") except Exception as e: logger.error(f"Failed to scrape profile_link: {e}") job_data['Profile Link'] = "N/A" try: job_type = soup.select_one('.css-17cdm7w div').text.strip() job_data['Job Type'] = job_type except Exception as e: logger.error(f"Failed to scrape job type: {e}") job_data['Job Type'] = "N/A" try: salary = soup.select_one('#salaryInfoAndJobType .eu4oa1w0').text.strip() job_data['Salary'] = salary except Exception as e: logger.error(f"Failed to scrape salary: {e}") job_data['Salary'] = "N/A" # Write the job data into CSV write_to_csv(job_data, existing_links) finally: # Close browser after scraping await browser.close() await botright_client.close() if __name__ == "__main__": try: asyncio.run(job_details_scraper()) except Exception as e: logger.error(f"Error in job details scraper: {e}")

import csv
import asyncio
from playwright.async_api import async_playwright
from loguru import logger
import botright
import warnings
from bs4 import BeautifulSoup
from transformers import AutoTokenizer

# Suppress the FutureWarning for transformers
warnings.filterwarnings("ignore", category=FutureWarning, module="transformers")

# Example usage of a tokenizer (replace with your actual model name)
# Set clean_up_tokenization_spaces to True to avoid the warning
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=True)

# Define log and CSV filenames
log_filename = "playwright_log.log"
input_csv_filename = "scraped_job_links.csv"  # Path to the input CSV file with job links
output_csv_filename = "scraped_job_details.csv"  # Path to the output CSV file for scraped details

# Clear the log file at the beginning
try:
    with open(log_filename, 'w'):
        pass  # Clear log file
except FileNotFoundError:
    pass

# Set up logging with loguru
logger.add(log_filename, rotation="1 week", retention="1 day", compression="zip")

# Function to read job links from input CSV file
def read_job_links():
    job_links = []
    try:
        with open(input_csv_filename, mode='r', newline='', encoding='utf-8') as csv_file:
            reader = csv.DictReader(csv_file)
            for row in reader:
                job_links.append(row['Page Link'])
    except FileNotFoundError:
        logger.error(f"Input CSV file not found: {input_csv_filename}")
    return job_links

# Function to read existing links from the output CSV
def read_existing_links():
    existing_links = set()
    try:
        with open(output_csv_filename, mode='r', newline='', encoding='utf-8') as csv_file:
            reader = csv.DictReader(csv_file)
            for row in reader:
                existing_links.add(row['Profile Link'])
    except FileNotFoundError:
        pass  # File does not exist yet, so no links to read
    return existing_links

# Function to write job data into CSV
def write_to_csv(data, existing_links):
    try:
        with open(output_csv_filename, mode='a', newline='', encoding='utf-8') as csv_file:
            fieldnames = ['Company Name','Job Title', 'Profile Link', 'Job Type', 'Salary']
            writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
            # Write header if the file is empty
            if csv_file.tell() == 0:
                writer.writeheader()
            # Check for duplicates before writing
            if data['Profile Link'] not in existing_links:
                writer.writerow(data)
                existing_links.add(data['Profile Link'])
                logger.info(f"Appended job details to CSV: {data}")
            else:
                logger.info(f"Duplicate found: {data['Profile Link']} - Skipping entry.")
    except Exception as e:
        logger.error(f"Failed to write to CSV: {e}")

async def job_details_scraper():
    logger.info("Starting Playwright with botright proxy")

    # Read job links from input CSV
    job_links = read_job_links()

    # Load existing links to avoid duplicates
    existing_links = read_existing_links()

    # Initialize Botright asynchronously
    botright_client = await botright.Botright()

    async with async_playwright() as p:
        # Use Botright to launch Playwright browser
        browser = await botright_client.new_browser()
        page = await browser.new_page()

        # Make the browser full-screen
        await page.set_viewport_size({"width": 1920, "height": 1080})

        try:
            for job_link in job_links:
                logger.info(f"Navigating to {job_link}")
                await page.goto(job_link, timeout=60000)

                # Wait for a while to let the page fully load (adjust as needed)
                await page.wait_for_timeout(5000)  # 5 seconds delay

                # Get the page content
                page_content = await page.content()
                
                # Parse the page content with BeautifulSoup
                soup = BeautifulSoup(page_content, 'html.parser')

                job_data = {}

                try:
                    company_name = soup.select_one('.css-1ioi40n').text.strip()
                    job_data['Company Name'] = company_name
                except Exception as e:
                    logger.error(f"Failed to scrape profile_link: {e}")
                    job_data['Company Name'] = "N/A"  

                # Scrape job details
                try:
                    job_title = soup.select_one('.css-1b4cr5z').text.strip()
                    job_data['Job Title'] = job_title
                except Exception as e:
                    logger.error(f"Failed to scrape job title: {e}")  
                    job_data['Job Title'] = "N/A"

                try:
                    profile_link_element = soup.select_one('.css-1ioi40n')
                    if profile_link_element and profile_link_element.has_attr('href'):
                        profile_link = profile_link_element['href']
                        job_data['Profile Link'] = profile_link
                    else:
                        job_data['Profile Link'] = "N/A"
                        logger.info("No link found with the specified selector.")
                except Exception as e:
                    logger.error(f"Failed to scrape profile_link: {e}")
                    job_data['Profile Link'] = "N/A"

                try:
                    job_type = soup.select_one('.css-17cdm7w div').text.strip()
                    job_data['Job Type'] = job_type
                except Exception as e:
                    logger.error(f"Failed to scrape job type: {e}")
                    job_data['Job Type'] = "N/A"

                try:
                    salary = soup.select_one('#salaryInfoAndJobType .eu4oa1w0').text.strip()
                    job_data['Salary'] = salary
                except Exception as e:
                    logger.error(f"Failed to scrape salary: {e}")
                    job_data['Salary'] = "N/A"

                # Write the job data into CSV
                write_to_csv(job_data, existing_links)

        finally:
            # Close browser after scraping
            await browser.close()
            await botright_client.close()

if __name__ == "__main__":
    try:
        asyncio.run(job_details_scraper())
    except Exception as e:
        logger.error(f"Error in job details scraper: {e}")

Code Explanation:

We crawl job links, then pull out necessary key information like job title, company name, salary and job type at each job page. This data is then written to a CSV so that it can be used later.

CSV Handling

read_job_links(): This method will read the job links which we have collected so far.

read_existing_links(): Read what is already there to prevent duplicates.

write_to_csv(): Adds the new job details to the CSV. The output of csv will be look like this:

job details

For each job page, we extract:

Job Title: Grabs the title using a CSS selector.
Company Name: Pulls the company name similarly.
Profile Link: Captures the job’s link if available.
Job Type: Fetches whether the job is full-time, part-time, etc.
Salary: If listed, we grab the salary info.

This script pulls job details efficiently, prevents duplicates, and handles CAPTCHA, ip blocking issues using Botright. It’s a reliable way to scrape job data and store it in an organized way, ready for analysis.

Conclusion

Recap

This guide has shown you the steps how to make a Indeed job scraper. We began by testing some Captcha challenges through Botright, but we never got any challenges on Indeed. This scraper was able to extract job urls along with additional data such as the name of the company, salary and title of the job, and saved all this information in a CSV file for analysis.

Future Considerations

You may wish to extend your scraper in the future so you can hit other job platforms or set it up to run on a schedule so your data is never stale. This project can be improved to an infinite level and many ways to scale.

Running the Script

If you see `io.UnsupportedOperation: fileno` error, run this script in your terminal with this command on Windows.

python script_name.py

python  script_name.py

click here for download source code from GitHub

watch the tutorial on youtube

Web Scraping Indeed with Python: Extract Job Listings and Salary Data

Learn How To Extract Job Listings And Salary Data by Scraping Indeed with Python:

Table of Contents

Overview

Challenges

Botright Overview

Botright for Google Captcha

Using Proxy in Botright

Scraping Job Listings from Indeed

Overview

Code Explanation

Scraping Job Details

Code Explanation:

Conclusion

Responses

Web Scraping Indeed with Python: Extract Job Listings and Salary Data

Learn How To Extract Job Listings And Salary Data by Scraping Indeed with Python:

Table of Contents

Overview

Challenges

Botright Overview

Botright for Google Captcha

Using Proxy in Botright

Scraping Job Listings from Indeed

Overview

Code Explanation

Scraping Job Details

Code Explanation:

Conclusion

Responses

Related Projects

Build a Google Shopping Scraper in Python for Price and Product Data

How to Create a Yahoo Scraper in Python for Search Data

Build a Bing Scraper in Python to Extract Search Results

Scrape Google My Business Data Using Python: A Step-by-Step Guide