Scrape Shopify Data with Python: A Comprehensive Shopify Scraper Tutorial

Learn how to create a Shopify scraper using Python to extract product data, prices, and more. Full tutorial with source code.

source code available here: decathlonUS_scraper

In today’s data-driven e-commerce landscape, the ability to extract and analyze product information from Shopify-based platforms can provide valuable insights for businesses. Whether you’re a business owner looking to understand market trends or a curious developer eager to explore data, scraping product information from Shopify can be incredibly useful.

In this tutorial, we’ll walk you through the process of building a simple Shopify scraper using Python. You’ll learn how to extract valuable data like product names, prices, and descriptions. By the end, you’ll have the data you need to analyze Shopify store data effectively, opening up a world of possibilities for your projects.

Scraping Decathlon: A Hands-On Example Project

To provide a practical, real-world example, we’ll focus on scraping data from the Decathlon website. Decathlon offers a great challenge for web scraping, making it the perfect case study. By tackling these obstacles, you’ll pick up skills that are not only useful for scraping Shopify but also applicable to many other websites and projects.

However, it’s important to remember that this exercise is for educational purposes only, and ethical web scraping practices should always be followed.

We’ll focus on scraping product information from the “Bags & Backpacks” category (https://www.decathlon.com/collections/backpacks-bags). Specifically, we’ll collect the following data for each product:

  1. Category
  2. Brand
  3. Product Name
  4. Star Rating
  5. Number of reviews
  6. Description
  7. Product ID
  8. Color Variation (If applicable)
  9. Size Variation (If applicable)
  10. Regular Price
  11. Previous Price (If applicable)
  12. Image URL
  13. Item URL

What We’ll Cover

  1. Setting Up a Scrapy Project: Learn how to create a new Scrapy project and configure it for our scraping task.
  2. Understanding Website Structure: Gain insights into the Decathlon website’s layout and identify the specific data points we want to extract.
  3. Writing Spider Code: Develop the spider code that will navigate through the pages and collect the required information.
  4. Handling Dynamic Content: Discover techniques to manage dynamic content and JavaScript-rendered elements that may affect our scraping.
  5. Implementing Best Practices: Understand the importance of ethical web scraping and how to follow responsible practices throughout the process.
  6. Processing and Storing Data: Learn how to process the extracted data and store it efficiently in a CSV file.

Workflow Overview

  1. Retrieve URLs for all collections within the “Bags & Backpacks” category.
  2. Extract product URLs from each category.
  3. Scrape detailed information from individual product pages.

retrieve collection urls Retrieve collection URLsItems that we want to scrape Items that we want to scrape

Setting Up Scrapy Project

Let’s begin by setting up our Scrapy project:

Install Scrapy

pip install scrapy

Create a new Scrapy project

scrapy startproject decathlonUS_scraper
cd decathlonUS_scraper

Generate a new spider

scrapy genspider bag_backpacks decathlon.com

Open the project folder in your code editor (we’re using VS code). You’ll see the file file structure like this:

Identify the element from the webpage

To efficiently extract data, we need to identify the relevant HTML elements. We’ll use the response.css() method to extract content using CSS selectors.

To inspect HTML elements:

  1. Right-click on the webpage and select “Inspect”.
  2. Click the arrow icon in the developer tools.
  3. Hover over the elements you wish to extract.

Inspect and get the “Category URL”

Let’s start by identifying the category URLs

 

We’re going to get the all href.

But notice here 3 same tags. And we want the values inside the third ul tag but that tag is inside the second li tag. It’s a bit confusing, so let’s test it inside the scrapy shell first to ensure we’re selecting the correct elements.

Run the code below inside your terminal:

scrapy shell "https://www.decathlon.com/collections/lifestyle-packs"
categories = response.css('ul.de-u-listReset ul.de-u-listReset')

Check the length:

len(categories)

Check href attributes inside the every “categories”

Our desired URLs are in the last “categories” element. Let’s incorporate this into our Scrapy code.

Open the bag_backpacks.py file

import scrapy

class BagBackpacksSpider(scrapy.Spider):
    name = "bag_backpacks"
    allowed_domains = ["decathlon.com"]
    start_urls = ["https://www.decathlon.com/collections/backpacks-bags"]
    base_url = "https://decathlon.com"


    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse_url_categories)


    def parse_url_categories(self, response):
        categories = response.css('ul.de-u-listReset ul.de-u-listReset')
        category_urls = categories[2].css('a::attr(href)').getall()
        for relative_url in category_urls:
            url = self.base_url + relative_url
            yield scrapy.Request(url, callback=self.parse_product_url)


    def parse_product_url(self, response):
        print(f"Parsing product URL: {response.url}")

Inspect and get the “Product URL”

Now let’s go to (https://www.decathlon.com/collections/lifestyle-packs) to get the “Product URL” for every products that appear on the page.

We will verify using this element tag first.

Run the Scrapy shell again. Make sure to quit() the previous scrapy shell

scrapy shell "https://www.decathlon.com/collections/lifestyle-packs"
product_url = response.css('a.js-de-ProductTile-link::attr(href)').getall()

Notice here it return the duplicate URLs. To remove the duplicate URLs we use a set() which is a Python’s built-in data structure that automatically removes duplication. 

unique_product_urls = list(set(product_url))

Add this code block to our bag_backpacks.py file.

    def parse_product_url(self, response):
        product_urls = response.css(
            'a.js-de-ProductTile-link::attr(href)').getall()
        unique_product_urls = list(set(product_urls))
        for product_url in unique_product_urls:
            url = self.base_url + product_url
            yield url
        return scrapy.Request(url, callback=self.parse_product)

Inspect and get the items inside the product page

Category

   

breadcrumb_links = response.css('nav.breadcrumb a::text').getall()

Then join the text with ‘/ ‘:

category = ' / '.join(breadcrumb_links).strip()

Brand

   

Let’s verify this element inside the Scrapy shell again:

scrapy shell "https://www.decathlon.com/collections/lifestyle-packs/products/quechua-nh-escape-500-16-l-hiking-backpack-334520"
brand = response.css('span.de-u-textGrow2.de-u-md-textGrow3.de-u-lg-textGrow4.de-u-textBold::text').get().strip()

Product Name

   

product_name = response.css('h1::text').get().strip()

Star rating

The star rating here isn’t show the number but if we go to the HTML element we can see the value is inside the span class="de-u-hiddenVisually" tag

star_rating = response.css('span.de-StarRating.de-u-spaceRight06 span.de-u-hiddenVisually::text').get().strip()

We need to clean the text a little bit by removing ‘(Average rating: ‘ and ‘out of 5 stars,’

We’ll use the Python Built-in function replace() method.

star_rating = response.css('span.de-StarRating.de-u-spaceRight06 span.de-u-hiddenVisually::text').get().strip().replace('(Average rating: ', '').replace(' out of 5 stars,', '')

Number of reviews

reviews = response.css('span.de-u-textMedium.de-u-textSelectNone.de-u-textBlue::text').get().strip()

Description

description = response.css('ul.about-this-item li::text').getall()

Inspecting and get the variables element

If we go to the product pages, we will see these elements change if we click on the variation.  For example, when we click on the color, the image and price (sometimes) change as well.

Variations (If applicable)

color = response.css('span.de-u-textDarkGray.de-u-textMedium.js-de-ColorInfo::text').get().strip()

After running this code, it returns the ‘Select a color’ sentence instead of ‘Yellow Ochre’ as we expected. 

Let’s view the page source by ‘right click’ and click on ‘View page source’ or ‘ctrl + u’.

It seems here it’s js-rendered since the value here is ‘Select a color’ as well.

Let’s scroll the page to see if something about the color appears on the page.


Notice here it have <select> tag with the id="productSelect" which contains the color, size, regular price and the product ID

details = response.css('select#productSelect option::text').getall()

We will use string manipulation techniques to extract the values from the result that we got above.

String Manipulation

We will extract the values for colors, sizes, product_ids and regular_prices

To test this, we will create a new file. We will call it string_manipulation.py to see the result.

details = [
    'Yellow Ochre / 16 L / 8844302 - $39.99 USD',
    'Carbon Gray / 16 L / 8649496 - $39.99 USD',
    'n Whale Gray / 16 L / 8649499 - Sold outn '
]

colors = []
sizes = []
product_ids = []
prices = []

for detail in details:
    parts = detail.split(' / ')
    color = parts[0].strip()
    size = parts[1]

    infos = parts[2].split(' - ')
    product_id = infos[0]
    price = infos[1].replace('USD', '').strip()

# Check if the price is not "Sold out"
    if price != "Sold out":
        # Append the extracted values to their respective lists
        colors.append(color)
        sizes.append(size)
        product_ids.append(product_id)
        prices.append(price)
    else:
        continue  # Skip the "Sold out" case entirely

print("Colors:", colors)
print("Sizes:", sizes)
print("Product IDs:", product_ids)
print("Prices:", prices)

The result from the code above

Previous Price (If applicable)

   

previous_price = response.css('del.js-de-CrossedOutPrice span.js-de-PriceAmount::text').get().strip()

Image URL

From the product page, we can see that there are a big image and smaller images at the side. We will call them as feature image and carousel images.

The feature image is the big one that appears on the page.

feature_img = response.css('img.de-CarouselFeature-image::attr(data-src)').getall()

While the carousel images are the one that appears on the side.

carousel_img = response.css('div.de-CarouselThumbnail-slide img::attr(data-src)').getall()

Product URL

Getting the URL is the easiest since we just need to get the URL that we are currently inspecting. Simply run:

url = response.url

Put the code together

Now that we identify all the elements needed, it’s time to put all the code blocks together inside the bag_backpacks.py file.

import scrapy

class BagBackpacksSpider(scrapy.Spider):
    name = "bag_backpacks"
    allowed_domains = ["decathlon.com"]
    start_urls = ["https://www.decathlon.com/collections/backpacks-bags"]
    base_url = "https://decathlon.com"

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse_url_categories)

    def parse_url_categories(self, response):
        categories = response.css('ul.de-u-listReset ul.de-u-listReset')
        category_urls = categories[2].css('a::attr(href)').getall()
        for relative_url in category_urls:
            url = self.base_url + relative_url
            yield scrapy.Request(url, callback=self.parse_product_url)

    def parse_product_url(self, response):
        product_urls = response.css(
            'a.js-de-ProductTile-link::attr(href)').getall()
        unique_product_urls = list(set(product_urls))
        for product_url in unique_product_urls:
            url = self.base_url + product_url
            yield scrapy.Request(url, callback=self.parse_product)

    def parse_product(self, response):
        breadcrumb_links = response.css('nav.breadcrumb a::text').getall()
        category = ' / '.join(breadcrumb_links).strip()
        brand = response.css(
            'span.de-u-textGrow2.de-u-md-textGrow3.de-u-lg-textGrow4.de-u-textBold::text').get().strip()
        product_name = response.css('h1::text').get().strip()
        star_rating = response.css('span.de-StarRating.de-u-spaceRight06 span.de-u-hiddenVisually::text').get(
        ).strip().replace('(Average rating: ', '').replace(' out of 5 stars,', '')
        reviews = response.css(
            'span.de-u-textMedium.de-u-textSelectNone.de-u-textBlue::text').get().strip()
        description = response.css('ul.about-this-item li::text').getall()
        all_description = 'n'.join(description)

        # Items that change depending on the variations
        details = response.css('select#productSelect option::text').getall()

        colors = []
        sizes = []
        product_ids = []
        regular_prices = []

        for detail in details:
            parts = detail.split(' / ')
            color = parts[0].strip()
            size = parts[1]

            infos = parts[2].split(' - ')
            product_id = infos[0]
            price = infos[1].replace('USD', '').strip()

        # Check if the price is not "Sold out"
            if price != "Sold out":
                # Append the extracted values to their respective lists
                colors.append(color)
                sizes.append(size)
                product_ids.append(product_id)
                regular_prices.append(price)
            else:
                continue  # Skip the "Sold out" case entirely

        prev_price = response.css(
            'del.js-de-CrossedOutPrice span.js-de-PriceAmount::text').get().strip()
        if prev_price:
            previous_price = prev_price
        else:
            previous_price = ''
        feature_imgs = response.css(
            'img.de-CarouselFeature-image::attr(data-src)').getall()
        feature_img = [f'https:{img}' for img in feature_imgs]
        carousel_imgs = response.css(
            'div.de-CarouselThumbnail-slide img::attr(data-src)').getall()
        carousel_img = [f'https:{img}' for img in carousel_imgs]
        url = response.url

        item = {
            'Category': category,
            'Brand': brand,
            'Product Name': product_name,
            'Star Rating': star_rating,
            'Number of reviews': reviews,
            'Description': all_description,
            'Product ID': ', '.join(product_ids),
            'Color': ', '.join(colors),
            'Size': ', '.join(sizes),
            'Regular Price': ', '.join(regular_prices),
            'Previous Price': previous_price,
            'Feature Image URLs': ', '.join(feature_img),
            'Carousel Image URLs': ', '.join(carousel_img),
            'URL': url
        }

        yield item

Modify the settings.py

BOT_NAME = "decathlonUS_scraper"

SPIDER_MODULES = ["decathlonUS_scraper.spiders"]
NEWSPIDER_MODULE = "decathlonUS_scraper.spiders"

ROBOTSTXT_OBEY = True

CONCURRENT_REQUESTS = 8

SPIDER_MIDDLEWARES = {
    "decathlonUS_scraper.middlewares.DecathlonusScraperSpiderMiddleware": 543,
}

DOWNLOADER_MIDDLEWARES = {
    "decathlonUS_scraper.middlewares.DecathlonusScraperDownloaderMiddleware": 543,
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

ITEM_PIPELINES = {
    "decathlonUS_scraper.pipelines.DecathlonusScraperPipeline": 300,
}

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

FEEDS = {
    'products_details.csv': {
        'format': 'csv',
        'overwrite': True,
    },
}

The default value of CONCURRENT_REQUESTS  is 16 but we set it to 8. This means, the Scrapy will allow up to 8 requests to be processed at the same time.

The default value for DOWNLOAD_DELAY is 0. In our code, we set it to 2. This will make our Scrapy to wait for 2 seconds after completing a request to the same domain before sending the next request. This helps to mimic human browsing behavior and reduces the risk of being flagged as a bot.

The FEEDS setting allows us to define where and how our scraped data will be stored. It’s a dictionary where the keys are the output file names (or storage URIs) and the values are dictionaries containing options that specify how the data should be formatted and handled.

For this case we will save our data inside a csv format named products_details.csv

Modify the items.py

import scrapy

class DecathlonusScraperItem(scrapy.Item):
    category = scrapy.Field()
    brand = scrapy.Field()
    product_name = scrapy.Field()
    star_rating = scrapy.Field()
    reviews = scrapy.Field()
    all_description = scrapy.Field()
    product_ids = scrapy.Field()
    colors = scrapy.Field()
    sizes = scrapy.Field()
    regular_prices = scrapy.Field()
    previous_price = scrapy.Field()
    feature_img = scrapy.Field()
    carousel_img = scrapy.Field()
    url = scrapy.Field()

Run the code

Run this code by running:

scrapy crawl bag_backpacks

This code works perfectly if we run it for the first time. But we need to keep in mind that our IP address might be blocked later on if we send many requests.

The results

We will find the products_details.csv is created inside our directory that looks like this:

Setting up Proxy Rotation (Optional)

Why do we need proxy rotation? When scraping websites, especially at scale, using a single IP or proxy increases the risk of being blocked by the site. Many websites monitor traffic for unusual patterns, and multiple requests from the same IP can trigger anti-scraping measures. Proxy rotation helps to distribute requests across various IP addresses, making it harder for websites to detect and block your scraper. This not only ensures uninterrupted scraping but also keeps your scraper running smoothly without raising red flags.

Get the proxy-list

Go to https://rayobyte.com/ and click on “Start My Trial” – You’ll get a 50 MB trial and NO credit card required!

Click on “Sign Up” on Rotating Proxy Dashboard

Enter your email and password. You’ll get an account verification link in your email. Verify your account, then you’ll get access to the “Residential Dashboard”

Click on the “Proxy List Generator”. 

Inside the dashboard you can configure which location etc..

For this tutorial, I just generated 10 IPs. Make sure to choose the format as username:password@hostname:port then click on download icon.

Save your proxy-list here:

Install Required Package

pip install scrapy-rotating-proxies

Configure the Settings

Next is we need to configure settings.py to use the rotating proxies.  We need to specify the path to the txt file containing our proxies. This file should have one proxy per line in the format http://username:password@proxy_ip:port

NOTE that the format that we saved before is in username:password@proxy_ip:port Therefore we need to append the http:// to the list using the function below.

import os

# Scrapy settings for decathlonuk_scraper project
ROTATING_PROXY_LIST_PATH = 'proxy-list.txt'

# Function to read and format the proxy list
def get_proxies():
    proxies = []
    if os.path.exists(ROTATING_PROXY_LIST_PATH):
        with open(ROTATING_PROXY_LIST_PATH, 'r') as file:
            for line in file:
                proxy = line.strip()
                if proxy:  # Ensure the line is not empty
                    # Add "http://" to each proxy
                    proxies.append(f"http://{proxy}")
    return proxies


# Set the formatted proxy list as a Scrapy setting
ROTATING_PROXY_LIST = get_proxies()

Include the these middleware for rotating proxies.

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

The RotatingProxyMiddleware and BanDetectionMiddleware provided by the scrapy-rotating-proxies package. Therefore, we don’t need to set up custom classes for it.

Run the code

scrapy crawl bag_backpacks

You can download the source code here: decathlonUS_scraper

Conclusion

This tutorial has demonstrated how to develop a comprehensive Shopify scraper using Python and the Scrapy framework. By following this guide, you’ve created a powerful tool for extracting valuable product data from Shopify-based e-commerce platforms.

Key takeaways include:

  • Setting up a Scrapy project efficiently
  • Navigating complex website structures
  • Extracting data from dynamic content
  • Implementing ethical scraping practices
  • Processing and storing scraped data effectively

As you apply these techniques to your own projects, remember to always respect website terms of service and implement responsible scraping practices. The insights gained from this data can drive informed business decisions, market analysis, and product strategy in the competitive e-commerce landscape.

Responses

Related Projects

PHP-based Web Crawler