Skip to content

Scraper Architecture

This document provides a detailed overview of the Scraper service's internal architecture and its role within the Labeeb platform. Understanding this is critical for effective troubleshooting and extension.


1. Platform Service Responsibilities

System-Wide Context

The Labeeb platform is a distributed system. A failure in one service can manifest as a symptom in another. This matrix defines the clear ownership and responsibility of each service, which is the foundation of our incident response process.

Service Tech Core Responsibility Inputs Outputs Depends On
API Laravel/PHP Central gateway, orchestrates jobs, owns PG & OS writes. Ingest batches, client requests. API responses, jobs. PG, Redis, OS, AI-Box.
AI-Box Python/FastAPI Hosts AI models (search, NER, etc.). API jobs/requests. Analysis results (JSON). OS, API (for hydration).
Scraper Python/FastAPI Fetches & normalizes articles from external sources. Profiles, external websites. Ingest batches. API (for ingestion).
Search OpenSearch Provides search capabilities. Indexing requests, search queries. Search results. (None)
Frontend Next.js User interface. User actions. Web pages. API.

2. Internal Architecture & Data Flow

/scraper/app/
├── core/                # Core application logic & configuration
├── data/                # Data models, normalization, and schemas
├── scraping/            # The business logic of scraping
   ├── providers/       # All specific provider implementations
└── services/            # Clients for external services & state

Architectural Principles

The Scraper's architecture is designed for extensibility and operational safety. The key design decisions are:

  • Decoupling Configuration from Code: Scraping logic (Python Provider classes) is kept entirely separate from scraping targets (JSON Profile files). This allows operators to add or change targets without deploying new code.
  • Provider-Based Strategy: Each external source type has a dedicated Provider class. This isolates the logic for handling different website layouts (e.g., RSS vs. HTML) and makes adding new sources predictable.
  • Stateful Operation: The service maintains a simple state file (state.json) to track the last-seen article for each profile. This prevents reprocessing duplicate data and provides a clear audit trail.
  • Dual-Mode Triggers: The service can be operated via a time-based schedule (APScheduler) for routine collection or triggered on-demand via a REST API for manual overrides and testing.

Data Flow Diagram (DFD)

This diagram illustrates the primary data flow for an on-demand scrape triggered via the API.

flowchart TD
    subgraph "User / Operator"
        U[Client]:::ext
    end

    subgraph "Scraper Service"
        S[FastAPI Server]:::svc
        P[Profile Loader]:::svc
        E[Scraper Engine]:::svc
        R[Provider Factory]:::svc
        F[Profiles/*.json]:::store
    end

    subgraph "Downstream"
        API[(Labeeb API)]:::ext
        W[(data/out/*.jsonl)]:::store
    end

    U -- "POST /scrape" --> S
    S --> P --> F
    S -- "Builds Job" --> E
    E --> R
    P -- "Provides Profile" --> R
    R -- "Selects & Runs" --> Provider(Provider Instance)
    Provider -- "Fetches Articles" --> A(Normalized Articles)
    A --> E
    E -- "Returns Response" --> S
    S -- "Optionally Sends" --> API
    S -- "Optionally Writes" --> W
    S -- "Job Response" --> U