Scrape Data from Gmarket.co.kr and Storing It in MySQL, MongoDB, and PostgreSQL

Comprehensive Guide to Scraping Data from Gmarket.co.kr and Efficiently Storing It in MySQL, MongoDB, and PostgreSQL

Scraping data from e-commerce websites like Gmarket.co.kr can provide valuable insights into market trends, consumer preferences, and competitive pricing. To efficiently store this data, it is essential to understand the nuances of different database systems such as MySQL, MongoDB, and PostgreSQL. This comprehensive guide will walk you through the process of scraping data from Gmarket.co.kr and storing it in these databases, ensuring that you can leverage the strengths of each system to meet your specific needs.

To begin with, web scraping involves extracting data from websites, and it is crucial to do so ethically and legally. Before scraping Gmarket.co.kr, review their terms of service to ensure compliance. Once you have the necessary permissions, you can use tools like BeautifulSoup or Scrapy in Python to automate the data extraction process. These tools allow you to parse HTML and XML documents, making it easier to navigate through the website’s structure and extract relevant information such as product names, prices, and descriptions.

After successfully scraping the data, the next step is to store it in a database. MySQL, a relational database management system, is a popular choice due to its robustness and ease of use. To store data in MySQL, you need to define a schema that outlines the structure of your data. This involves creating tables with appropriate columns to hold the scraped information. Using SQL queries, you can then insert the data into these tables, ensuring that it is organized and easily accessible for future analysis.

In contrast, MongoDB is a NoSQL database that offers flexibility in handling unstructured data. It stores data in JSON-like documents, which can be advantageous when dealing with diverse data types or when the data structure is not fixed. To store scraped data in MongoDB, you can use the PyMongo library in Python to connect to a MongoDB instance and insert the data as documents. This approach allows for greater scalability and adaptability, especially when dealing with large volumes of data or when the data model is expected to evolve over time.

PostgreSQL, another relational database system, combines the strengths of MySQL with advanced features such as support for complex queries and data integrity. It is particularly well-suited for applications that require transactional integrity and complex data relationships. To store data in PostgreSQL, you can use the psycopg2 library in Python to establish a connection and execute SQL commands to create tables and insert data. PostgreSQL’s support for JSON data types also allows for some flexibility in handling semi-structured data, bridging the gap between traditional relational databases and NoSQL systems.

In conclusion, scraping data from Gmarket.co.kr and storing it in MySQL, MongoDB, and PostgreSQL involves understanding the unique capabilities of each database system and choosing the one that best aligns with your data requirements. By leveraging the strengths of these databases, you can ensure that your data is stored efficiently and is readily available for analysis. Whether you prioritize the structured approach of MySQL, the flexibility of MongoDB, or the advanced features of PostgreSQL, each system offers distinct advantages that can enhance your data management strategy. As you embark on this journey, remember to adhere to ethical guidelines and legal requirements, ensuring that your data scraping efforts are both effective and responsible.

1. Scraping Gmarket and Storing Data in MySQL

from selenium import webdriver
from selenium.webdriver.common.by import By
import mysql.connector

# Set up Selenium WebDriver
driver = webdriver.Chrome()
driver.get("https://www.gmarket.co.kr/")

# Scrape product data (example: first product title and price)
products = driver.find_elements(By.CSS_SELECTOR, ".box__item-title")
prices = driver.find_elements(By.CSS_SELECTOR, ".box__price-seller strong")

data = []
for product, price in zip(products[:10], prices[:10]):  # Limit to 10 products
    data.append((product.text, price.text.replace(",", "").replace("₩", "")))

driver.quit()

# Connect to MySQL
conn = mysql.connector.connect(
    host="localhost",
    user="root",
    password="yourpassword",
    database="gmarket_db"
)
cursor = conn.cursor()

# Create table if not exists
cursor.execute("""
    CREATE TABLE IF NOT EXISTS products (
        id INT AUTO_INCREMENT PRIMARY KEY,
        title VARCHAR(255),
        price INT
    )
""")

# Insert data into MySQL
cursor.executemany("INSERT INTO products (title, price) VALUES (%s, %s)", data)
conn.commit()
conn.close()

MySQL Query to Retrieve Data

SELECT * FROM products ORDER BY price DESC;

2. Scraping Gmarket and Storing Data in MongoDB

from selenium import webdriver
from selenium.webdriver.common.by import By
from pymongo import MongoClient

# Set up Selenium WebDriver
driver = webdriver.Chrome()
driver.get("https://www.gmarket.co.kr/")

# Scrape product data
products = driver.find_elements(By.CSS_SELECTOR, ".box__item-title")
prices = driver.find_elements(By.CSS_SELECTOR, ".box__price-seller strong")

data = []
for product, price in zip(products[:10], prices[:10]):
    data.append({
        "title": product.text,
        "price": int(price.text.replace(",", "").replace("₩", ""))
    })

driver.quit()

# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client["gmarket_db"]
collection = db["products"]

# Insert data into MongoDB
collection.insert_many(data)
print("Data inserted successfully!")

MongoDB Query to Retrieve Data

db.products.find().sort({price: -1}).pretty();

3. Scraping Gmarket and Storing Data in PostgreSQL

from selenium import webdriver
from selenium.webdriver.common.by import By
import psycopg2

# Set up Selenium WebDriver
driver = webdriver.Chrome()
driver.get("https://www.gmarket.co.kr/")

# Scrape product data
products = driver.find_elements(By.CSS_SELECTOR, ".box__item-title")
prices = driver.find_elements(By.CSS_SELECTOR, ".box__price-seller strong")

data = []
for product, price in zip(products[:10], prices[:10]):
    data.append((product.text, int(price.text.replace(",", "").replace("₩", ""))))

driver.quit()

# Connect to PostgreSQL
conn = psycopg2.connect(
    dbname="gmarket_db",
    user="postgres",
    password="yourpassword",
    host="localhost"
)
cursor = conn.cursor()

# Create table if not exists
cursor.execute("""
    CREATE TABLE IF NOT EXISTS products (
        id SERIAL PRIMARY KEY,
        title TEXT,
        price INTEGER
    )
""")

# Insert data into PostgreSQL
cursor.executemany("INSERT INTO products (title, price) VALUES (%s, %s)", data)
conn.commit()
conn.close()

PostgreSQL Query to Retrieve Data

SELECT * FROM products ORDER BY price DESC;

Summary of Techniques Used

  • Selenium for dynamic web scraping (due to JavaScript-rendered content on Gmarket).
  • Structured Data Storage in MySQL and PostgreSQL (Relational DBs).
  • NoSQL Storage in MongoDB (Document-based for flexible data storage).

 

Responses

Related blogs