On This Page

Go Web Scraper Architecture

This document outlines the architecture and design principles of the Go Web Scraper project, explaining how the different components interact to create a robust web scraping solution.

High-Level Architecture

The Go Web Scraper follows a modular architecture that separates concerns and promotes maintainability. The system is divided into several components:

Architecture Diagram

Core Components

1. Command-Line Interface (CLI)

Located in cmd/scraper/main.go, the CLI provides a user-friendly interface for:

  • Configuring scraping parameters
  • Initiating scraping jobs
  • Exporting data to various formats
  • Starting the web server

The CLI uses the Go standard library flag package for argument parsing and configuration.

2. Scraper Modules

The scrapers/ directory contains individual scraper implementations:

  • Base Scraper Interface (scrapers/scraper.go): Defines the interface that all scrapers must implement
  • Hacker News Scraper (scrapers/hackernews.go): Extracts articles from Hacker News
  • Bookstore Scraper (scrapers/bookstore.go): Extracts products from a bookstore website

Each scraper implements the following interface:

type Scraper interface {
    Initialize() error
    Scrape(options ScrapeOptions) ([]interface{}, error)
    Name() string
}

3. Data Models

The models/ directory contains data structures for:

  • Article (models/article.go): Represents articles from news sites
  • Product (models/product.go): Represents products from e-commerce sites
  • Scraper (models/scraper.go): Represents scraper configuration and metadata

These models are used across the application for data persistence and API responses.

4. Database Layer

Located in db/ directory, the database layer handles:

  • Database connection and initialization
  • Schema creation and migrations
  • CRUD operations for scraped data
  • Query building and execution

The application uses SQLite for persistent storage, providing a lightweight yet powerful database solution.

5. Web Interface

The web interface is contained in the web/ directory and consists of:

  • HTTP Server (web/server.go): Handles HTTP requests and response rendering
  • Templates (web/templates/): HTML templates using Go’s html/template package
  • Static Assets (web/static/): CSS, JavaScript, and images
  • API Endpoints (web/api.go): RESTful API for accessing scraped data

The web interface uses the standard Go net/http package without external frameworks to minimize dependencies.

Data Flow

  1. User Input: Either through CLI flags or web interface forms
  2. Scraper Selection: The appropriate scraper is selected based on the target website
  3. Data Extraction: The scraper navigates the website and extracts structured data
  4. Processing: Raw data is processed and transformed into the appropriate models
  5. Storage: Data is stored in the SQLite database
  6. Presentation: Data is either exported to files or presented via the web interface

Concurrency Model

The application leverages Go’s concurrency primitives:

  • Goroutines: Used for parallel scraping and processing
  • Channels: Used for communication between scraping workers
  • Sync Package: Used for coordination and synchronization

The concurrency model is implemented in utils/collector.go, which provides:

  • Rate limiting to respect website policies
  • Work distribution across multiple goroutines
  • Error handling and propagation

Database Schema

The SQLite database schema includes the following tables:

Articles Table

+-----------------+-------------+
| Column          | Type        |
+-----------------+-------------+
| id              | INTEGER     |
| title           | TEXT        |
| url             | TEXT        |
| author          | TEXT        |
| points          | INTEGER     |
| comments        | INTEGER     |
| date            | TIMESTAMP   |
| source          | TEXT        |
| content         | TEXT        |
+-----------------+-------------+

Products Table

+-----------------+-------------+
| Column          | Type        |
+-----------------+-------------+
| id              | INTEGER     |
| title           | TEXT        |
| price           | REAL        |
| description     | TEXT        |
| image_url       | TEXT        |
| rating          | REAL        |
| category        | TEXT        |
| source          | TEXT        |
+-----------------+-------------+

Categories Table

+----------------+       +----------------+
| id             |       | id             |
| name           |       | name           |
+----------------+       | parent_id      |
        ^                +----------------+
        |                        ^
        |                        |
+----------------+       +----------------+
| product_id     |       | category_id    |
| category_id    |       | product_id     |
+----------------+       +----------------+

API Design

The REST API follows these principles:

  • Resource-oriented routes (e.g., /api/articles, /api/products)
  • Standard HTTP methods (GET, POST, PUT, DELETE)
  • JSON responses with consistent structure
  • Pagination for list endpoints
  • Comprehensive error handling

Authentication & Security

The application currently does not implement authentication as it’s designed for personal use. However, it includes:

  • Input validation to prevent SQL injection
  • CORS headers for web security
  • Rate limiting to prevent abuse

Ethical Scraping Measures

The application implements several measures to ensure ethical web scraping:

  • Configurable delays between requests
  • User-agent rotation to distribute load
  • Respect for robots.txt directives
  • Option to limit the depth and breadth of scraping

Testing Strategy

The project includes various types of tests:

  • Unit Tests: Testing individual functions and methods
  • Integration Tests: Testing interactions between components
  • End-to-End Tests: Testing complete user workflows

Future Architecture Considerations

Planned architectural improvements include:

  • Switching to a plugin architecture for scrapers
  • Adding support for distributed scraping
  • Implementing a more robust job queue
  • Adding support for additional database backends