{"id":905,"date":"2024-09-13T14:48:41","date_gmt":"2024-09-13T14:48:41","guid":{"rendered":"https:\/\/rayobyte.com\/community\/?post_type=scraping_project&#038;p=905"},"modified":"2024-10-09T14:53:52","modified_gmt":"2024-10-09T14:53:52","slug":"scrape-shopify-data-with-python-a-comprehensive-shopify-scraper-tutorial","status":"publish","type":"scraping_project","link":"https:\/\/rayobyte.com\/community\/scraping-project\/scrape-shopify-data-with-python-a-comprehensive-shopify-scraper-tutorial\/","title":{"rendered":"Scrape Shopify Data with Python: A Comprehensive Shopify Scraper Tutorial"},"content":{"rendered":"<p style=\"text-align: center;\"><iframe loading=\"lazy\" title=\"YouTube video player\" src=\"https:\/\/www.youtube.com\/embed\/pAl0SYkeHd8?si=BEko8i-rIMXHOK_h\" width=\"560\" height=\"315\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><br \/>\n<i><span style=\"font-weight: 400;\">Learn how to create a Shopify scraper using Python to extract product data, prices, and more. Full tutorial with source code.<\/span><\/i><\/p>\n<p><i><span style=\"font-weight: 400;\">source code available here: <a href=\"https:\/\/github.com\/ainacodes\/decathlonUS_scraper\" rel=\"nofollow noopener\" target=\"_blank\">decathlonUS_scraper<\/a><\/span><\/i><\/p>\n<p>video tutorial: Shopify <a href=\"https:\/\/youtu.be\/pAl0SYkeHd8?si=nEgkZD7N8r0ZfG06\" rel=\"nofollow noopener\" target=\"_blank\">Data Scraping with Python &amp; Scrapy | Complete Shopify Scraper Guide<\/a><\/p>\n<h2>Table of Content<\/h2>\n<p><a href=\"#introduction\">Introduction<\/a><br \/>\n<a href=\"#example-project\">Scraping Decathlon: A Hands-On Example Project<\/a><br \/>\n<a href=\"#cover\">What We&#8217;ll Cover<\/a><br \/>\n<a href=\"#workflow\">Workflow Overview<\/a><br \/>\n<a href=\"#setup-scrapy\">Setting Up Scrapy Project<\/a><br \/>\n<a href=\"#identify-element\">Identify the element from the webpage<\/a><br \/>\n<a href=\"#category-url\">Inspect and get the &#8220;Category URL&#8221;<\/a><br \/>\n<a href=\"#product-url\">Inspect and get the &#8220;Product URL&#8221;<\/a><br \/>\n<a href=\"#product-page\">Inspect and get the items inside the product page<\/a><br \/>\n<a href=\"#variable-elements\">Inspecting and get the variables element<\/a><br \/>\n<a href=\"#complete-code\">Put the code together<\/a><br \/>\n<a href=\"#setup-proxy\">Setting up Proxy Rotation (Optional)<\/a><br \/>\n<a href=\"#conclusion\">Conclusion<\/a><\/p>\n<h2 id=\"introduction\">Introduction<\/h2>\n<p><span style=\"font-weight: 400;\">In today&#8217;s data-driven e-commerce landscape, the ability to extract and analyze product information from Shopify-based platforms can provide valuable insights for businesses. Whether you&#8217;re a business owner looking to understand market trends or a curious developer eager to explore data, scraping product information from Shopify can be incredibly useful.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In this tutorial, we&#8217;ll walk you through the process of building a simple Shopify scraper using Python. You&#8217;ll learn how to extract valuable data like product names, prices, and descriptions. By the end, you&#8217;ll have the data you need to analyze Shopify store data effectively, opening up a world of possibilities for your projects.<\/span><\/p>\n<h2 id=\"&quot;example-project\"><span style=\"font-weight: 400;\">Scraping Decathlon: A Hands-On Example Project<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">To provide a practical, real-world example, we&#8217;ll focus on scraping data from the Decathlon website. Decathlon offers a great challenge for web scraping, making it the perfect case study. By tackling these obstacles, you&#8217;ll pick up skills that are not only useful for scraping Shopify but also applicable to many other websites and projects.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, it&#8217;s important to remember that this exercise is for educational purposes only, and ethical web scraping practices should always be followed.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We&#8217;ll focus on scraping product information from the <strong>&#8220;Bags &amp; Backpacks&#8221;<\/strong> category (<a href=\"https:\/\/www.decathlon.com\/collections\/backpacks-bags\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/www.decathlon.com\/collections\/backpacks-bags<\/a>). Specifically, we&#8217;ll collect the following data for each product:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Category<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Brand<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Product Name<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Star Rating<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Number of reviews<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Description<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Product ID<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Color Variation (If applicable)<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Size Variation (If applicable)<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Regular Price<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Previous Price (If applicable)<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Image URL<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Item URL<\/span><\/li>\n<\/ol>\n<h2 id=\"cover\"><span style=\"font-weight: 400;\">What We&#8217;ll Cover<\/span><\/h2>\n<ol>\n<li style=\"font-weight: 400;\"><b>Setting Up a Scrapy Project:<\/b><span style=\"font-weight: 400;\"> Learn how to create a new Scrapy project and configure it for our scraping task.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Understanding Website Structure:<\/b><span style=\"font-weight: 400;\"> Gain insights into the Decathlon website&#8217;s layout and identify the specific data points we want to extract.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Writing Spider Code:<\/b><span style=\"font-weight: 400;\"> Develop the spider code that will navigate through the pages and collect the required information.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Handling Dynamic Content:<\/b><span style=\"font-weight: 400;\"> Discover techniques to manage dynamic content and JavaScript-rendered elements that may affect our scraping.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Implementing Best Practices: <\/b><span style=\"font-weight: 400;\">Understand the importance of ethical web scraping and how to follow responsible practices throughout the process.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Processing and Storing Data:<\/b><span style=\"font-weight: 400;\"> Learn how to process the extracted data and store it efficiently in a CSV file.<\/span><\/li>\n<\/ol>\n<h2 id=\"workflow\"><span style=\"font-weight: 400;\">Workflow Overview<\/span><\/h2>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Retrieve URLs for all collections within the &#8220;Bags &amp; Backpacks&#8221; category.<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Extract product URLs from each category.<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Scrape detailed information from individual product pages.<\/span><\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-906 size-large\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/retrieve_collection_ulrs-1024x574.png\" alt=\"retrieve collection urls\" width=\"1024\" height=\"574\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/retrieve_collection_ulrs-1024x574.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/retrieve_collection_ulrs-300x168.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/retrieve_collection_ulrs-768x431.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/retrieve_collection_ulrs-624x350.png 624w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/retrieve_collection_ulrs.png 1211w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/> Retrieve collection URLs<img loading=\"lazy\" decoding=\"async\" class=\"wp-image-908 size-large\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/all_items-1024x630.png\" alt=\"Items that we want to scrape\" width=\"1024\" height=\"630\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/all_items-1024x630.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/all_items-300x185.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/all_items-768x473.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/all_items-624x384.png 624w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/all_items.png 1254w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/> Items that we want to scrape<\/p>\n<h2 id=\"setup-scrapy\">Setting Up Scrapy Project<\/h2>\n<p><span style=\"font-weight: 400;\">Let&#8217;s begin by setting up our Scrapy project:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Install Scrapy<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\">pip install scrapy<\/pre>\n<p>Create a new Scrapy project<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\">scrapy startproject decathlonUS_scraper\r\ncd decathlonUS_scraper<\/pre>\n<p>Generate a new spider<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\">scrapy genspider bag_backpacks decathlon.com<\/pre>\n<p><span style=\"font-weight: 400;\">Open the project folder in your code editor (we&#8217;re using VS code). You&#8217;ll see the file file structure like this:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1034\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/folder.png\" alt=\"\" width=\"428\" height=\"513\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/folder.png 428w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/folder-250x300.png 250w\" sizes=\"auto, (max-width: 428px) 100vw, 428px\" \/><\/p>\n<h2 id=\"identify-element\"><span style=\"font-weight: 400;\">Identify the element from the webpage<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">To efficiently extract data, we need to identify the relevant HTML elements. We&#8217;ll use the <code>response.css()<\/code> method to extract content using CSS selectors.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To inspect HTML elements:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Right-click on the webpage and select &#8220;Inspect&#8221;.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Click the arrow icon in the developer tools.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Hover over the elements you wish to extract.<\/span><\/li>\n<\/ol>\n<h2 id=\"category-url\"><span style=\"font-weight: 400;\">Inspect and get the &#8220;Category URL&#8221;<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Let&#8217;s start by identifying the category URLs<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1032\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_url_elm.png\" alt=\"\" width=\"494\" height=\"534\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_url_elm.png 494w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_url_elm-278x300.png 278w\" sizes=\"auto, (max-width: 494px) 100vw, 494px\" \/>\u00a0 <img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-951\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_html.png\" alt=\"\" width=\"468\" height=\"551\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_html.png 468w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_html-255x300.png 255w\" sizes=\"auto, (max-width: 468px) 100vw, 468px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">We\u2019re going to get the all <code>href<\/code>.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-910\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_html_2.png\" alt=\"\" width=\"534\" height=\"242\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_html_2.png 534w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_html_2-300x136.png 300w\" sizes=\"auto, (max-width: 534px) 100vw, 534px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">But notice here 3 same tags. And we want the values inside the third <code>ul<\/code> tag but that tag is inside the second <code>li<\/code> tag. It\u2019s a bit confusing, so let&#8217;s test it inside the scrapy shell first to ensure we&#8217;re selecting the correct elements.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Run the code below inside your terminal:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">scrapy shell \"https:\/\/www.decathlon.com\/collections\/lifestyle-packs\"<\/pre>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">categories = response.css('ul.de-u-listReset ul.de-u-listReset')<\/pre>\n<p><span style=\"font-weight: 400;\">Check the length:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">len(categories)<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-932\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_len.png\" alt=\"\" width=\"689\" height=\"67\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_len.png 689w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_len-300x29.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_len-624x61.png 624w\" sizes=\"auto, (max-width: 689px) 100vw, 689px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Check <code>href<\/code> attributes inside the every &#8220;categories&#8221;<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-929\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_response.png\" alt=\"\" width=\"672\" height=\"465\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_response.png 672w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_response-300x208.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/category_response-624x432.png 624w\" sizes=\"auto, (max-width: 672px) 100vw, 672px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Our desired URLs are in the last &#8220;categories&#8221; element. Let&#8217;s incorporate this into our Scrapy code.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Open the <\/span><strong>bag_backpacks.py<\/strong><span style=\"font-weight: 400;\">\u00a0file<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import scrapy\r\n\r\nclass BagBackpacksSpider(scrapy.Spider):\r\n\u00a0 \u00a0 name = \"bag_backpacks\"\r\n\u00a0 \u00a0 allowed_domains = [\"decathlon.com\"]\r\n\u00a0 \u00a0 start_urls = [\"https:\/\/www.decathlon.com\/collections\/backpacks-bags\"]\r\n\u00a0 \u00a0 base_url = \"https:\/\/decathlon.com\"\r\n\r\n\r\n\u00a0 \u00a0 def start_requests(self):\r\n\u00a0 \u00a0 \u00a0 \u00a0 for url in self.start_urls:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 yield scrapy.Request(url, callback=self.parse_url_categories)\r\n\r\n\r\n\u00a0 \u00a0 def parse_url_categories(self, response):\r\n\u00a0 \u00a0 \u00a0 \u00a0 categories = response.css('ul.de-u-listReset ul.de-u-listReset')\r\n\u00a0 \u00a0 \u00a0 \u00a0 category_urls = categories[2].css('a::attr(href)').getall()\r\n\u00a0 \u00a0 \u00a0 \u00a0 for relative_url in category_urls:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 url = self.base_url + relative_url\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 yield scrapy.Request(url, callback=self.parse_product_url)\r\n\r\n\r\n\u00a0 \u00a0 def parse_product_url(self, response):\r\n\u00a0 \u00a0 \u00a0 \u00a0 print(f\"Parsing product URL: {response.url}\")<\/pre>\n<h2 id=\"product-url\"><span style=\"font-weight: 400;\">Inspect and get the &#8220;Product URL&#8221;<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Now let\u2019s go to (<a href=\"https:\/\/www.decathlon.com\/collections\/lifestyle-packs\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/www.decathlon.com\/collections\/lifestyle-packs<\/a>) to get the &#8220;Product URL&#8221; for every products that appear on the page.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We will verify using this element tag first.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1037\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/url_elem.png\" alt=\"\" width=\"887\" height=\"748\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/url_elem.png 887w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/url_elem-300x253.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/url_elem-768x648.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/url_elem-624x526.png 624w\" sizes=\"auto, (max-width: 887px) 100vw, 887px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Run the Scrapy shell again. Make sure to <code>quit()<\/code><\/span><span style=\"font-weight: 400;\">\u00a0the previous scrapy shell<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">scrapy shell \"https:\/\/www.decathlon.com\/collections\/lifestyle-packs\"<\/pre>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">product_url = response.css('a.js-de-ProductTile-link::attr(href)').getall()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-945\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_url_response.png\" alt=\"\" width=\"737\" height=\"344\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_url_response.png 737w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_url_response-300x140.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_url_response-624x291.png 624w\" sizes=\"auto, (max-width: 737px) 100vw, 737px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Notice here it return the duplicate URLs. To remove the duplicate URLs we use a <code>set()<\/code> which is a Python&#8217;s built-in data structure that automatically removes duplication.\u00a0<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">unique_product_urls = list(set(product_url))<\/pre>\n<p><span style=\"font-weight: 400;\">Add this code block to our <\/span><strong>bag_backpacks.py<\/strong><span style=\"font-weight: 400;\"> file.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">    def parse_product_url(self, response):\r\n\u00a0 \u00a0 \u00a0 \u00a0 product_urls = response.css(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'a.js-de-ProductTile-link::attr(href)').getall()\r\n\u00a0 \u00a0 \u00a0 \u00a0 unique_product_urls = list(set(product_urls))\r\n\u00a0 \u00a0 \u00a0 \u00a0 for product_url in unique_product_urls:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 url = self.base_url + product_url\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 yield url\r\n\u00a0 \u00a0 \u00a0 \u00a0 return scrapy.Request(url, callback=self.parse_product)<\/pre>\n<h2 id=\"product-page\"><span style=\"font-weight: 400;\">Inspect and get the items inside the product page<\/span><\/h2>\n<h3><span style=\"font-weight: 400;\">Category<\/span><\/h3>\n<p><span style=\"font-weight: 400;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1016\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/nav_breadcrumb_element-1.png\" alt=\"\" width=\"518\" height=\"188\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/nav_breadcrumb_element-1.png 518w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/nav_breadcrumb_element-1-300x109.png 300w\" sizes=\"auto, (max-width: 518px) 100vw, 518px\" \/>\u00a0 \u00a0 <img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1017\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/nav_tag.png\" alt=\"\" width=\"704\" height=\"387\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/nav_tag.png 704w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/nav_tag-300x165.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/nav_tag-624x343.png 624w\" sizes=\"auto, (max-width: 704px) 100vw, 704px\" \/><\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">breadcrumb_links = response.css('nav.breadcrumb a::text').getall()<\/pre>\n<p><span style=\"font-weight: 400;\">Then join the text with &#8216;\/ &#8216;:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">category = ' \/ '.join(breadcrumb_links).strip()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-921\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/breadcrumb.png\" alt=\"\" width=\"571\" height=\"73\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/breadcrumb.png 571w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/breadcrumb-300x38.png 300w\" sizes=\"auto, (max-width: 571px) 100vw, 571px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">Brand<\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1008\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_element-1.png\" alt=\"\" width=\"685\" height=\"206\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_element-1.png 685w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_element-1-300x90.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_element-1-624x188.png 624w\" sizes=\"auto, (max-width: 685px) 100vw, 685px\" \/>\u00a0 \u00a0 \u00a0 <img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1010\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_tag.png\" alt=\"\" width=\"662\" height=\"121\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_tag.png 662w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_tag-300x55.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_tag-624x114.png 624w\" sizes=\"auto, (max-width: 662px) 100vw, 662px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Let&#8217;s verify this element inside the Scrapy shell again:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">scrapy shell \"https:\/\/www.decathlon.com\/collections\/lifestyle-packs\/products\/quechua-nh-escape-500-16-l-hiking-backpack-334520\"<\/pre>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">brand = response.css('span.de-u-textGrow2.de-u-md-textGrow3.de-u-lg-textGrow4.de-u-textBold::text').get().strip()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-912\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_response.png\" alt=\"\" width=\"734\" height=\"72\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_response.png 734w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_response-300x29.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/brand_response-624x61.png 624w\" sizes=\"auto, (max-width: 734px) 100vw, 734px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">Product Name<\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1012\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_name_element-1.png\" alt=\"\" width=\"661\" height=\"286\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_name_element-1.png 661w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_name_element-1-300x130.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_name_element-1-624x270.png 624w\" sizes=\"auto, (max-width: 661px) 100vw, 661px\" \/>\u00a0 \u00a0 <img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1013\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_name_tag.png\" alt=\"\" width=\"667\" height=\"123\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_name_tag.png 667w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_name_tag-300x55.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_name_tag-624x115.png 624w\" sizes=\"auto, (max-width: 667px) 100vw, 667px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">product_name = response.css('h1::text').get().strip()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-939\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_name_response.png\" alt=\"\" width=\"467\" height=\"51\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_name_response.png 467w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_name_response-300x33.png 300w\" sizes=\"auto, (max-width: 467px) 100vw, 467px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">Star rating<\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1020\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_element-1.png\" alt=\"\" width=\"564\" height=\"200\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_element-1.png 564w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_element-1-300x106.png 300w\" sizes=\"auto, (max-width: 564px) 100vw, 564px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">The star rating here isn&#8217;t show the number but if we go to the HTML element we can see the value is inside the <code>span class=\"de-u-hiddenVisually\"<\/code> tag<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1022\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_tag.png\" alt=\"\" width=\"616\" height=\"192\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_tag.png 616w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_tag-300x94.png 300w\" sizes=\"auto, (max-width: 616px) 100vw, 616px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">star_rating = response.css('span.de-StarRating.de-u-spaceRight06 span.de-u-hiddenVisually::text').get().strip()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-923\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_response.png\" alt=\"\" width=\"736\" height=\"71\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_response.png 736w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_response-300x29.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_response-624x60.png 624w\" sizes=\"auto, (max-width: 736px) 100vw, 736px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">We need to clean the text a little bit by removing <em>&#8216;(Average rating: &#8216; <\/em>and &#8216;<em>out of 5 stars,&#8217;<\/em><\/span><\/p>\n<p><span style=\"font-weight: 400;\">We&#8217;ll use the Python Built-in function <code>replace()<\/code> method.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">star_rating = response.css('span.de-StarRating.de-u-spaceRight06 span.de-u-hiddenVisually::text').get().strip().replace('(Average rating: ', '').replace(' out of 5 stars,', '')<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-942\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_response-2.png\" alt=\"\" width=\"736\" height=\"73\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_response-2.png 736w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_response-2-300x30.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/star_rating_response-2-624x62.png 624w\" sizes=\"auto, (max-width: 736px) 100vw, 736px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">Number of reviews<\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1023\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/review_element-1.png\" alt=\"\" width=\"543\" height=\"220\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/review_element-1.png 543w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/review_element-1-300x122.png 300w\" sizes=\"auto, (max-width: 543px) 100vw, 543px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">reviews = response.css('span.de-u-textMedium.de-u-textSelectNone.de-u-textBlue::text').get().strip()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-941\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/review_response.png\" alt=\"\" width=\"712\" height=\"76\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/review_response.png 712w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/review_response-300x32.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/review_response-624x67.png 624w\" sizes=\"auto, (max-width: 712px) 100vw, 712px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">Description<\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1025\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/desc_element.png\" alt=\"\" width=\"940\" height=\"264\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/desc_element.png 940w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/desc_element-300x84.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/desc_element-768x216.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/desc_element-624x175.png 624w\" sizes=\"auto, (max-width: 940px) 100vw, 940px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">description = response.css('ul.about-this-item li::text').getall()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-931\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/description_response.png\" alt=\"\" width=\"729\" height=\"88\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/description_response.png 729w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/description_response-300x36.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/description_response-624x75.png 624w\" sizes=\"auto, (max-width: 729px) 100vw, 729px\" \/><\/p>\n<h2 id=\"variable-elements\"><span style=\"font-weight: 400;\">Inspecting and get the variables element<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">If we go to the product pages, we will see these elements change if we click on the variation.\u00a0 For example, when we click on the color, the image and price (sometimes) change as well.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Variations (If applicable)<\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1026\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/color_element.png\" alt=\"\" width=\"579\" height=\"242\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/color_element.png 579w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/color_element-300x125.png 300w\" sizes=\"auto, (max-width: 579px) 100vw, 579px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">color = response.css('span.de-u-textDarkGray.de-u-textMedium.js-de-ColorInfo::text').get().strip()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-933\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/color_response.png\" alt=\"\" width=\"672\" height=\"71\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/color_response.png 672w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/color_response-300x32.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/color_response-624x66.png 624w\" sizes=\"auto, (max-width: 672px) 100vw, 672px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">After running this code, it returns the &#8216;Select a color&#8217; sentence instead of &#8216;Yellow Ochre&#8217; as we expected.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Let&#8217;s view the page source by &#8216;right click&#8217; and click on &#8216;View page source&#8217; or &#8216;ctrl + u&#8217;.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1028\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/select_color_elm.png\" alt=\"\" width=\"659\" height=\"147\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/select_color_elm.png 659w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/select_color_elm-300x67.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/select_color_elm-624x139.png 624w\" sizes=\"auto, (max-width: 659px) 100vw, 659px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">It seems here it&#8217;s js-rendered since the value here is &#8216;Select a color&#8217; as well.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Let&#8217;s scroll the page to see if something about the color appears on the page.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-944\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_variaton.png\" alt=\"\" width=\"770\" height=\"286\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_variaton.png 770w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_variaton-300x111.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_variaton-768x285.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/product_variaton-624x232.png 624w\" sizes=\"auto, (max-width: 770px) 100vw, 770px\" \/><br \/>\n<span style=\"font-weight: 400;\">Notice here it have <code>&lt;select&gt;<\/code> tag with the <code>id=\"productSelect\"<\/code> which contains the <strong>color<\/strong>, <strong>size<\/strong>, <strong>regular price<\/strong> and the <strong>product ID<\/strong><\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">details = response.css('select#productSelect option::text').getall()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-955\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/variation_response.png\" alt=\"\" width=\"657\" height=\"77\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/variation_response.png 657w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/variation_response-300x35.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/variation_response-624x73.png 624w\" sizes=\"auto, (max-width: 657px) 100vw, 657px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">We will use string manipulation techniques to extract the values from the result that we got above.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">String Manipulation<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">We will extract the values for colors, sizes, product_ids and regular_prices<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To test this, we will create a new file. We will call it <\/span><strong>string_manipulation.py<\/strong><span style=\"font-weight: 400;\"> to see the result.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">details = [\r\n\u00a0 \u00a0 'Yellow Ochre \/ 16 L \/ 8844302 - $39.99 USD',\r\n\u00a0 \u00a0 'Carbon Gray \/ 16 L \/ 8649496 - $39.99 USD',\r\n\u00a0 \u00a0 'n Whale Gray \/ 16 L \/ 8649499 - Sold outn '\r\n]\r\n\r\ncolors = []\r\nsizes = []\r\nproduct_ids = []\r\nprices = []\r\n\r\nfor detail in details:\r\n\u00a0 \u00a0 parts = detail.split(' \/ ')\r\n\u00a0 \u00a0 color = parts[0].strip()\r\n\u00a0 \u00a0 size = parts[1]\r\n\r\n\u00a0 \u00a0 infos = parts[2].split(' - ')\r\n\u00a0 \u00a0 product_id = infos[0]\r\n\u00a0 \u00a0 price = infos[1].replace('USD', '').strip()\r\n\r\n# Check if the price is not \"Sold out\"\r\n\u00a0 \u00a0 if price != \"Sold out\":\r\n\u00a0 \u00a0 \u00a0 \u00a0 # Append the extracted values to their respective lists\r\n\u00a0 \u00a0 \u00a0 \u00a0 colors.append(color)\r\n\u00a0 \u00a0 \u00a0 \u00a0 sizes.append(size)\r\n\u00a0 \u00a0 \u00a0 \u00a0 product_ids.append(product_id)\r\n\u00a0 \u00a0 \u00a0 \u00a0 prices.append(price)\r\n\u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 continue\u00a0 # Skip the \"Sold out\" case entirely\r\n\r\nprint(\"Colors:\", colors)\r\nprint(\"Sizes:\", sizes)\r\nprint(\"Product IDs:\", product_ids)\r\nprint(\"Prices:\", prices)<\/pre>\n<p><span style=\"font-weight: 400;\">The result from the code above<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1033\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/string_manipulation.png\" alt=\"\" width=\"615\" height=\"129\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/string_manipulation.png 615w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/string_manipulation-300x63.png 300w\" sizes=\"auto, (max-width: 615px) 100vw, 615px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">Previous Price (If applicable)<\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1027\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/prev_price_element.png\" alt=\"\" width=\"477\" height=\"172\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/prev_price_element.png 477w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/prev_price_element-300x108.png 300w\" sizes=\"auto, (max-width: 477px) 100vw, 477px\" \/>\u00a0 \u00a0 <img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1029\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/prev_price_tag.png\" alt=\"\" width=\"571\" height=\"175\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/prev_price_tag.png 571w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/prev_price_tag-300x92.png 300w\" sizes=\"auto, (max-width: 571px) 100vw, 571px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">previous_price = response.css('del.js-de-CrossedOutPrice span.js-de-PriceAmount::text').get().strip()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-937\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/cross_price_response.png\" alt=\"\" width=\"670\" height=\"68\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/cross_price_response.png 670w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/cross_price_response-300x30.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/cross_price_response-624x63.png 624w\" sizes=\"auto, (max-width: 670px) 100vw, 670px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">Image URL<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">From the product page, we can see that there are a big image and smaller images at the side. We will call them as feature image and carousel images.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1030\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/img_carousel_element.png\" alt=\"\" width=\"943\" height=\"859\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/img_carousel_element.png 943w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/img_carousel_element-300x273.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/img_carousel_element-768x700.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/img_carousel_element-624x568.png 624w\" sizes=\"auto, (max-width: 943px) 100vw, 943px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">The feature image is the big one that appears on the page.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">feature_img = response.css('img.de-CarouselFeature-image::attr(data-src)').getall()<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-953\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/feature_img_response.png\" alt=\"\" width=\"606\" height=\"88\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/feature_img_response.png 606w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/feature_img_response-300x44.png 300w\" sizes=\"auto, (max-width: 606px) 100vw, 606px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">While the carousel images are the one that appears on the side.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1031\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/img_ca_elm.png\" alt=\"\" width=\"496\" height=\"337\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/img_ca_elm.png 496w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/img_ca_elm-300x204.png 300w\" sizes=\"auto, (max-width: 496px) 100vw, 496px\" \/><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">carousel_img = response.css('div.de-CarouselThumbnail-slide img::attr(data-src)').getall()<\/pre>\n<h3><span style=\"font-weight: 400;\">Product URL<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Getting the URL is the easiest since we just need to get the URL that we are currently inspecting. Simply run:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">url = response.url<\/pre>\n<h2 id=\"complete-code\"><span style=\"font-weight: 400;\">Put the code together<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Now that we identify all the elements needed, it&#8217;s time to put all the code blocks together inside the <\/span><strong>bag_backpacks.py<\/strong><span style=\"font-weight: 400;\"> file.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import scrapy\r\n\r\nclass BagBackpacksSpider(scrapy.Spider):\r\n\u00a0 \u00a0 name = \"bag_backpacks\"\r\n\u00a0 \u00a0 allowed_domains = [\"decathlon.com\"]\r\n\u00a0 \u00a0 start_urls = [\"https:\/\/www.decathlon.com\/collections\/backpacks-bags\"]\r\n\u00a0 \u00a0 base_url = \"https:\/\/decathlon.com\"\r\n\r\n\u00a0 \u00a0 def start_requests(self):\r\n\u00a0 \u00a0 \u00a0 \u00a0 for url in self.start_urls:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 yield scrapy.Request(url, callback=self.parse_url_categories)\r\n\r\n\u00a0 \u00a0 def parse_url_categories(self, response):\r\n\u00a0 \u00a0 \u00a0 \u00a0 categories = response.css('ul.de-u-listReset ul.de-u-listReset')\r\n\u00a0 \u00a0 \u00a0 \u00a0 category_urls = categories[2].css('a::attr(href)').getall()\r\n\u00a0 \u00a0 \u00a0 \u00a0 for relative_url in category_urls:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 url = self.base_url + relative_url\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 yield scrapy.Request(url, callback=self.parse_product_url)\r\n\r\n\u00a0 \u00a0 def parse_product_url(self, response):\r\n\u00a0 \u00a0 \u00a0 \u00a0 product_urls = response.css(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'a.js-de-ProductTile-link::attr(href)').getall()\r\n\u00a0 \u00a0 \u00a0 \u00a0 unique_product_urls = list(set(product_urls))\r\n\u00a0 \u00a0 \u00a0 \u00a0 for product_url in unique_product_urls:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 url = self.base_url + product_url\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 yield scrapy.Request(url, callback=self.parse_product)\r\n\r\n\u00a0 \u00a0 def parse_product(self, response):\r\n\u00a0 \u00a0 \u00a0 \u00a0 breadcrumb_links = response.css('nav.breadcrumb a::text').getall()\r\n\u00a0 \u00a0 \u00a0 \u00a0 category = ' \/ '.join(breadcrumb_links).strip()\r\n\u00a0 \u00a0 \u00a0 \u00a0 brand = response.css(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'span.de-u-textGrow2.de-u-md-textGrow3.de-u-lg-textGrow4.de-u-textBold::text').get().strip()\r\n\u00a0 \u00a0 \u00a0 \u00a0 product_name = response.css('h1::text').get().strip()\r\n\u00a0 \u00a0 \u00a0 \u00a0 star_rating = response.css('span.de-StarRating.de-u-spaceRight06 span.de-u-hiddenVisually::text').get(\r\n\u00a0 \u00a0 \u00a0 \u00a0 ).strip().replace('(Average rating: ', '').replace(' out of 5 stars,', '')\r\n\u00a0 \u00a0 \u00a0 \u00a0 reviews = response.css(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'span.de-u-textMedium.de-u-textSelectNone.de-u-textBlue::text').get().strip()\r\n\u00a0 \u00a0 \u00a0 \u00a0 description = response.css('ul.about-this-item li::text').getall()\r\n\u00a0 \u00a0 \u00a0 \u00a0 all_description = 'n'.join(description)\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 # Items that change depending on the variations\r\n\u00a0 \u00a0 \u00a0 \u00a0 details = response.css('select#productSelect option::text').getall()\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 colors = []\r\n\u00a0 \u00a0 \u00a0 \u00a0 sizes = []\r\n\u00a0 \u00a0 \u00a0 \u00a0 product_ids = []\r\n\u00a0 \u00a0 \u00a0 \u00a0 regular_prices = []\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 for detail in details:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 parts = detail.split(' \/ ')\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 color = parts[0].strip()\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size = parts[1]\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 infos = parts[2].split(' - ')\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 product_id = infos[0]\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 price = infos[1].replace('USD', '').strip()\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 # Check if the price is not \"Sold out\"\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if price != \"Sold out\":\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # Append the extracted values to their respective lists\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 colors.append(color)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 sizes.append(size)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 product_ids.append(product_id)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 regular_prices.append(price)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 continue\u00a0 # Skip the \"Sold out\" case entirely\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 prev_price = response.css(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'del.js-de-CrossedOutPrice span.js-de-PriceAmount::text').get().strip()\r\n\u00a0 \u00a0 \u00a0 \u00a0 if prev_price:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 previous_price = prev_price\r\n\u00a0 \u00a0 \u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 previous_price = ''\r\n\u00a0 \u00a0 \u00a0 \u00a0 feature_imgs = response.css(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'img.de-CarouselFeature-image::attr(data-src)').getall()\r\n\u00a0 \u00a0 \u00a0 \u00a0 feature_img = [f'https:{img}' for img in feature_imgs]\r\n\u00a0 \u00a0 \u00a0 \u00a0 carousel_imgs = response.css(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'div.de-CarouselThumbnail-slide img::attr(data-src)').getall()\r\n\u00a0 \u00a0 \u00a0 \u00a0 carousel_img = [f'https:{img}' for img in carousel_imgs]\r\n\u00a0 \u00a0 \u00a0 \u00a0 url = response.url\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 item = {\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Category': category,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Brand': brand,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Product Name': product_name,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Star Rating': star_rating,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Number of reviews': reviews,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Description': all_description,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Product ID': ', '.join(product_ids),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Color': ', '.join(colors),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Size': ', '.join(sizes),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Regular Price': ', '.join(regular_prices),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Previous Price': previous_price,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Feature Image URLs': ', '.join(feature_img),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'Carousel Image URLs': ', '.join(carousel_img),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 'URL': url\r\n\u00a0 \u00a0 \u00a0 \u00a0 }\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 yield item<\/pre>\n<h3><span style=\"font-weight: 400;\">Modify the <\/span><strong>settings.py<\/strong><\/h3>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">BOT_NAME = \"decathlonUS_scraper\"\r\n\r\nSPIDER_MODULES = [\"decathlonUS_scraper.spiders\"]\r\nNEWSPIDER_MODULE = \"decathlonUS_scraper.spiders\"\r\n\r\nROBOTSTXT_OBEY = True\r\n\r\nCONCURRENT_REQUESTS = 8\r\n\r\nSPIDER_MIDDLEWARES = {\r\n\u00a0 \u00a0 \"decathlonUS_scraper.middlewares.DecathlonusScraperSpiderMiddleware\": 543,\r\n}\r\n\r\nDOWNLOADER_MIDDLEWARES = {\r\n\u00a0 \u00a0 \"decathlonUS_scraper.middlewares.DecathlonusScraperDownloaderMiddleware\": 543,\r\n\u00a0 \u00a0 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,\r\n\u00a0 \u00a0 'rotating_proxies.middlewares.BanDetectionMiddleware': 620,\r\n}\r\n\r\nITEM_PIPELINES = {\r\n\u00a0 \u00a0 \"decathlonUS_scraper.pipelines.DecathlonusScraperPipeline\": 300,\r\n}\r\n\r\nREQUEST_FINGERPRINTER_IMPLEMENTATION = \"2.7\"\r\nTWISTED_REACTOR = \"twisted.internet.asyncioreactor.AsyncioSelectorReactor\"\r\nFEED_EXPORT_ENCODING = \"utf-8\"\r\n\r\nFEEDS = {\r\n\u00a0 \u00a0 'products_details.csv': {\r\n\u00a0 \u00a0 \u00a0 \u00a0 'format': 'csv',\r\n\u00a0 \u00a0 \u00a0 \u00a0 'overwrite': True,\r\n\u00a0 \u00a0 },\r\n}<\/pre>\n<p><span style=\"font-weight: 400;\">The default value of <strong>CONCURRENT_REQUESTS<\/strong>\u00a0 is 16 but we set it to 8. This means, the Scrapy will allow up to 8 requests to be processed at the same time.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The default value for <strong>DOWNLOAD_DELAY<\/strong> is 0. In our code, we set it to 2. This will make our Scrapy to wait for 2 seconds after completing a request to the same domain before sending the next request. This helps to mimic human browsing behavior and reduces the risk of being flagged as a bot.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The <strong>FEEDS<\/strong> setting allows us to define where and how our scraped data will be stored. It&#8217;s a dictionary where the keys are the output file names (or storage URIs) and the values are dictionaries containing options that specify how the data should be formatted and handled.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For this case we will save our data inside a csv format named <\/span><strong>products_details.csv<\/strong><span style=\"font-weight: 400;\">.\u00a0<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Modify the <\/span><strong>items.py<\/strong><\/h3>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import scrapy\r\n\r\nclass DecathlonusScraperItem(scrapy.Item):\r\n\u00a0 \u00a0 category = scrapy.Field()\r\n\u00a0 \u00a0 brand = scrapy.Field()\r\n\u00a0 \u00a0 product_name = scrapy.Field()\r\n\u00a0 \u00a0 star_rating = scrapy.Field()\r\n\u00a0 \u00a0 reviews = scrapy.Field()\r\n\u00a0 \u00a0 all_description = scrapy.Field()\r\n\u00a0 \u00a0 product_ids = scrapy.Field()\r\n\u00a0 \u00a0 colors = scrapy.Field()\r\n\u00a0 \u00a0 sizes = scrapy.Field()\r\n\u00a0 \u00a0 regular_prices = scrapy.Field()\r\n\u00a0 \u00a0 previous_price = scrapy.Field()\r\n\u00a0 \u00a0 feature_img = scrapy.Field()\r\n\u00a0 \u00a0 carousel_img = scrapy.Field()\r\n\u00a0 \u00a0 url = scrapy.Field()<\/pre>\n<h3><span style=\"font-weight: 400;\">Run the code<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Run this code by running:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">scrapy crawl bag_backpacks<\/pre>\n<p><span style=\"font-weight: 400;\">This code works perfectly if we run it for the first time. But we need to keep in mind that our IP address might be blocked later on if we send many requests.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">The results<\/span><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-934\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/csv_response.png\" alt=\"\" width=\"717\" height=\"58\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/csv_response.png 717w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/csv_response-300x24.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/csv_response-624x50.png 624w\" sizes=\"auto, (max-width: 717px) 100vw, 717px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">We will find the <\/span><strong>products_details.csv<\/strong><span style=\"font-weight: 400;\"> is created inside our directory that looks like this:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-963 size-full\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/csv_snapshot.png\" alt=\"\" width=\"1410\" height=\"552\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/csv_snapshot.png 1410w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/csv_snapshot-300x117.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/csv_snapshot-1024x401.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/csv_snapshot-768x301.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/csv_snapshot-624x244.png 624w\" sizes=\"auto, (max-width: 1410px) 100vw, 1410px\" \/><\/p>\n<h2 id=\"setup-proxy\"><span style=\"font-weight: 400;\">Setting up Proxy Rotation (Optional)<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Why do we need proxy rotation? When scraping websites, especially at scale, using a single IP or proxy increases the risk of being blocked by the site. Many websites monitor traffic for unusual patterns, and multiple requests from the same IP can trigger anti-scraping measures. Proxy rotation helps to distribute requests across various IP addresses, making it harder for websites to detect and block your scraper. This not only ensures uninterrupted scraping but also keeps your scraper running smoothly without raising red flags.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Get the proxy-list<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Go to <\/span><a href=\"https:\/\/rayobyte.com\/\"><span style=\"font-weight: 400;\">https:\/\/rayobyte.com\/<\/span><\/a><span style=\"font-weight: 400;\"> and click on &#8220;Start My Trial&#8221; \u2013 You&#8217;ll get a 50 MB trial and NO credit card required!<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-960\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/rayobyte_page-1024x544.png\" alt=\"\" width=\"640\" height=\"340\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/rayobyte_page-1024x544.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/rayobyte_page-300x159.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/rayobyte_page-768x408.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/rayobyte_page-1536x816.png 1536w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/rayobyte_page-624x332.png 624w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/rayobyte_page.png 1920w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Click on &#8220;Sign Up&#8221; on Rotating Proxy Dashboard<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-961\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_pricing-1024x544.png\" alt=\"\" width=\"640\" height=\"340\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_pricing-1024x544.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_pricing-300x159.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_pricing-768x408.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_pricing-1536x816.png 1536w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_pricing-624x332.png 624w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_pricing.png 1920w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Enter your email and password. You\u2019ll get an account verification link in your email. Verify your account, then you\u2019ll get access to the &#8220;Residential Dashboard&#8221;<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1036\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/generate_proxy.png\" alt=\"\" width=\"514\" height=\"883\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/generate_proxy.png 514w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/generate_proxy-175x300.png 175w\" sizes=\"auto, (max-width: 514px) 100vw, 514px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Click on the &#8220;Proxy List Generator&#8221;.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Inside the dashboard you can configure which location etc..<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-964\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_page-1024x347.png\" alt=\"\" width=\"640\" height=\"217\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_page-1024x347.png 1024w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_page-300x102.png 300w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_page-768x260.png 768w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_page-624x211.png 624w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_page.png 1376w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">For this tutorial, I just generated 10 IPs. Make sure to choose the format as <\/span><strong>username:password@hostname:port<\/strong><span style=\"font-weight: 400;\"> then click on download icon.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Save your proxy-list here:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1035\" src=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_location.png\" alt=\"\" width=\"441\" height=\"594\" title=\"\" srcset=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_location.png 441w, https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/proxy_location-223x300.png 223w\" sizes=\"auto, (max-width: 441px) 100vw, 441px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">Install Required Package<\/span><\/h3>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">pip install scrapy-rotating-proxies<\/pre>\n<h3><span style=\"font-weight: 400;\">Configure the Settings<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Next is we need to configure <\/span><strong>settings.py<\/strong><span style=\"font-weight: 400;\"> to use the rotating proxies.\u00a0 We need to specify the path to the txt file containing our proxies. This file should have one proxy per line in the format <\/span><strong>http:\/\/username:password@proxy_ip:port<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">NOTE that the format that we saved before is in <\/span><strong>username:password@proxy_ip:port<\/strong><span style=\"font-weight: 400;\"> Therefore we need to append the <\/span><span style=\"font-weight: 400;\">http:\/\/<\/span><span style=\"font-weight: 400;\"> to the list using the function below.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import os\r\n\r\n# Scrapy settings for decathlonuk_scraper project\r\nROTATING_PROXY_LIST_PATH = 'proxy-list.txt'\r\n\r\n# Function to read and format the proxy list\r\ndef get_proxies():\r\n\u00a0 \u00a0 proxies = []\r\n\u00a0 \u00a0 if os.path.exists(ROTATING_PROXY_LIST_PATH):\r\n\u00a0 \u00a0 \u00a0 \u00a0 with open(ROTATING_PROXY_LIST_PATH, 'r') as file:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 for line in file:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 proxy = line.strip()\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if proxy:\u00a0 # Ensure the line is not empty\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # Add \"http:\/\/\" to each proxy\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 proxies.append(f\"http:\/\/{proxy}\")\r\n\u00a0 \u00a0 return proxies\r\n\r\n\r\n# Set the formatted proxy list as a Scrapy setting\r\nROTATING_PROXY_LIST = get_proxies()<\/pre>\n<p><span style=\"font-weight: 400;\">Include the these middleware for rotating proxies.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">DOWNLOADER_MIDDLEWARES = {\r\n\u00a0 \u00a0 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,\r\n\u00a0 \u00a0 'rotating_proxies.middlewares.BanDetectionMiddleware': 620,\r\n}<\/pre>\n<p><span style=\"font-weight: 400;\">The <code>RotatingProxyMiddleware<\/code> and <code>BanDetectionMiddleware<\/code> provided by the <\/span><span style=\"font-weight: 400;\">scrapy-rotating-proxies<\/span><span style=\"font-weight: 400;\">\u00a0package. Therefore, we don\u2019t need to set up custom classes for it.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Run the code<\/span><\/h3>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">scrapy crawl bag_backpacks<\/pre>\n<p><span style=\"font-weight: 400;\">You can download the source code here: <\/span><a href=\"https:\/\/github.com\/ainacodes\/decathlonUS_scraper\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">decathlonUS_scraper<\/span><\/a><\/p>\n<h2 id=\"conclusion\"><span style=\"font-weight: 400;\">Conclusion<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">This tutorial has demonstrated how to develop a comprehensive Shopify scraper using Python and the Scrapy framework. By following this guide, you&#8217;ve created a powerful tool for extracting valuable product data from Shopify-based e-commerce platforms.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Key takeaways include:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Setting up a Scrapy project efficiently<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Navigating complex website structures<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Extracting data from dynamic content<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Implementing ethical scraping practices<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Processing and storing scraped data effectively<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">As you apply these techniques to your own projects, remember to always respect website terms of service and implement responsible scraping practices. The insights gained from this data can drive informed business decisions, market analysis, and product strategy in the competitive e-commerce landscape.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learn how to create a Shopify scraper using Python to extract product data, prices, and more. Full tutorial with source code. source code available here:&hellip;<\/p>\n","protected":false},"author":25,"featured_media":968,"comment_status":"open","ping_status":"closed","template":"","meta":{"rank_math_lock_modified_date":false},"categories":[],"class_list":["post-905","scraping_project","type-scraping_project","status-publish","has-post-thumbnail","hentry"],"_links":{"self":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/scraping_project\/905","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/scraping_project"}],"about":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/types\/scraping_project"}],"author":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/users\/25"}],"replies":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/comments?post=905"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/media\/968"}],"wp:attachment":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/media?parent=905"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/categories?post=905"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}