Go Web Scraper Architecture

This document outlines the architecture and design principles of the Go Web Scraper project, explaining how the different components interact to create a robust web scraping solution.

High-Level Architecture

The Go Web Scraper follows a modular architecture that separates concerns and promotes maintainability. The system is divided into several components:

Core Components

1. Command-Line Interface (CLI)

Located in cmd/scraper/main.go, the CLI provides a user-friendly interface for:

Configuring scraping parameters
Initiating scraping jobs
Exporting data to various formats
Starting the web server

The CLI uses the Go standard library flag package for argument parsing and configuration.

2. Scraper Modules

The scrapers/ directory contains individual scraper implementations:

Base Scraper Interface (scrapers/scraper.go): Defines the interface that all scrapers must implement
Hacker News Scraper (scrapers/hackernews.go): Extracts articles from Hacker News
Bookstore Scraper (scrapers/bookstore.go): Extracts products from a bookstore website

Each scraper implements the following interface:

type Scraper interface {
    Initialize() error
    Scrape(options ScrapeOptions) ([]interface{}, error)
    Name() string
}

3. Data Models

The models/ directory contains data structures for:

Article (models/article.go): Represents articles from news sites
Product (models/product.go): Represents products from e-commerce sites
Scraper (models/scraper.go): Represents scraper configuration and metadata

These models are used across the application for data persistence and API responses.

4. Database Layer

Located in db/ directory, the database layer handles:

Database connection and initialization
Schema creation and migrations
CRUD operations for scraped data
Query building and execution

The application uses SQLite for persistent storage, providing a lightweight yet powerful database solution.

5. Web Interface

The web interface is contained in the web/ directory and consists of:

HTTP Server (web/server.go): Handles HTTP requests and response rendering
Templates (web/templates/): HTML templates using Go’s html/template package
Static Assets (web/static/): CSS, JavaScript, and images
API Endpoints (web/api.go): RESTful API for accessing scraped data

The web interface uses the standard Go net/http package without external frameworks to minimize dependencies.

Data Flow

User Input: Either through CLI flags or web interface forms
Scraper Selection: The appropriate scraper is selected based on the target website
Data Extraction: The scraper navigates the website and extracts structured data
Processing: Raw data is processed and transformed into the appropriate models
Storage: Data is stored in the SQLite database
Presentation: Data is either exported to files or presented via the web interface

Concurrency Model

The application leverages Go’s concurrency primitives:

Goroutines: Used for parallel scraping and processing
Channels: Used for communication between scraping workers
Sync Package: Used for coordination and synchronization

The concurrency model is implemented in utils/collector.go, which provides:

Rate limiting to respect website policies
Work distribution across multiple goroutines
Error handling and propagation

Database Schema

The SQLite database schema includes the following tables:

Articles Table

+-----------------+-------------+
| Column          | Type        |
+-----------------+-------------+
| id              | INTEGER     |
| title           | TEXT        |
| url             | TEXT        |
| author          | TEXT        |
| points          | INTEGER     |
| comments        | INTEGER     |
| date            | TIMESTAMP   |
| source          | TEXT        |
| content         | TEXT        |
+-----------------+-------------+

Products Table

+-----------------+-------------+
| Column          | Type        |
+-----------------+-------------+
| id              | INTEGER     |
| title           | TEXT        |
| price           | REAL        |
| description     | TEXT        |
| image_url       | TEXT        |
| rating          | REAL        |
| category        | TEXT        |
| source          | TEXT        |
+-----------------+-------------+

Categories Table

+----------------+       +----------------+
| id             |       | id             |
| name           |       | name           |
+----------------+       | parent_id      |
        ^                +----------------+
        |                        ^
        |                        |
+----------------+       +----------------+
| product_id     |       | category_id    |
| category_id    |       | product_id     |
+----------------+       +----------------+

API Design

The REST API follows these principles:

Resource-oriented routes (e.g., /api/articles, /api/products)
Standard HTTP methods (GET, POST, PUT, DELETE)
JSON responses with consistent structure
Pagination for list endpoints
Comprehensive error handling

Authentication & Security

The application currently does not implement authentication as it’s designed for personal use. However, it includes:

Input validation to prevent SQL injection
CORS headers for web security
Rate limiting to prevent abuse

Ethical Scraping Measures

The application implements several measures to ensure ethical web scraping:

Configurable delays between requests
User-agent rotation to distribute load
Respect for robots.txt directives
Option to limit the depth and breadth of scraping

Testing Strategy

The project includes various types of tests:

Unit Tests: Testing individual functions and methods
Integration Tests: Testing interactions between components
End-to-End Tests: Testing complete user workflows

Future Architecture Considerations

Planned architectural improvements include:

Switching to a plugin architecture for scrapers
Adding support for distributed scraping
Implementing a more robust job queue
Adding support for additional database backends

On This Page