Scraper Architecture¶

This document provides a detailed overview of the Scraper service's internal architecture and its role within the Labeeb platform. Understanding this is critical for effective troubleshooting and extension.

1. Platform Service Responsibilities¶

System-Wide Context

The Labeeb platform is a distributed system. A failure in one service can manifest as a symptom in another. This matrix defines the clear ownership and responsibility of each service, which is the foundation of our incident response process.

Service	Tech	Core Responsibility	Inputs	Outputs	Depends On
API	Laravel/PHP	Central gateway, orchestrates jobs, owns PG & OS writes.	Ingest batches, client requests.	API responses, jobs.	PG, Redis, OS, AI-Box.
AI-Box	Python/FastAPI	Hosts AI models (search, NER, etc.).	API jobs/requests.	Analysis results (JSON).	OS, API (for hydration).
Scraper	Python/FastAPI	Fetches & normalizes articles from external sources.	Profiles, external websites.	Ingest batches.	API (for ingestion).
Search	OpenSearch	Provides search capabilities.	Indexing requests, search queries.	Search results.	(None)
Frontend	Next.js	User interface.	User actions.	Web pages.	API.

2. Internal Architecture & Data Flow¶

/scraper/app/
├── core/                # Core application logic & configuration
├── data/                # Data models, normalization, and schemas
├── scraping/            # The business logic of scraping
│   ├── providers/       # All specific provider implementations
└── services/            # Clients for external services & state

Architectural Principles

The Scraper's architecture is designed for extensibility and operational safety. The key design decisions are:

Decoupling Configuration from Code: Scraping logic (Python Provider classes) is kept entirely separate from scraping targets (JSON Profile files). This allows operators to add or change targets without deploying new code.
Provider-Based Strategy: Each external source type has a dedicated Provider class. This isolates the logic for handling different website layouts (e.g., RSS vs. HTML) and makes adding new sources predictable.
Stateful Operation: The service maintains a simple state file (state.json) to track the last-seen article for each profile. This prevents reprocessing duplicate data and provides a clear audit trail.
Dual-Mode Triggers: The service can be operated via a time-based schedule (APScheduler) for routine collection or triggered on-demand via a REST API for manual overrides and testing.

Data Flow Diagram (DFD)¶

This diagram illustrates the primary data flow for an on-demand scrape triggered via the API.

flowchart TD
    subgraph "User / Operator"
        U[Client]:::ext
    end

    subgraph "Scraper Service"
        S[FastAPI Server]:::svc
        P[Profile Loader]:::svc
        E[Scraper Engine]:::svc
        R[Provider Factory]:::svc
        F[Profiles/*.json]:::store
    end

    subgraph "Downstream"
        API[(Labeeb API)]:::ext
        W[(data/out/*.jsonl)]:::store
    end

    U -- "POST /scrape" --> S
    S --> P --> F
    S -- "Builds Job" --> E
    E --> R
    P -- "Provides Profile" --> R
    R -- "Selects & Runs" --> Provider(Provider Instance)
    Provider -- "Fetches Articles" --> A(Normalized Articles)
    A --> E
    E -- "Returns Response" --> S
    S -- "Optionally Sends" --> API
    S -- "Optionally Writes" --> W
    S -- "Job Response" --> U