Scraper Service: The Data Ingestion Playbook¶

Service Status: Operational

This document is the primary operational manual for the Scraper Service. It provides comprehensive, actionable guidance for on-call engineers to deploy, monitor, and troubleshoot this critical data ingestion component of the Labeeb platform. All procedures are designed for clarity, accuracy, and safe execution under pressure.

1. Mission & Scope¶

The Scraper's mission is to be the sole, reliable, and highly adaptable entry point for all external article data into the Labeeb platform.

It operates as a highly available, horizontally scalable microservice responsible for fetching, normalizing, and forwarding content from a diverse and ever-changing set of external sources. The service's architecture prioritizes configuration over code, enabling rapid adaptation to new sources and formats with minimal engineering overhead.

Scope of Responsibilities

Is Responsible For:
- Profile Management: Loading and validating a version-controlled library of JSON scraping profiles that define all data sources.
- Job Execution: Executing scraping jobs on a recurring schedule (cron-based) and on-demand via its API.
- Data Normalization: Transforming raw scraped data into the platform's canonical Article schema.
- Upstream Forwarding: Batched forwarding of normalized data to the core API's ingestion endpoint.
- State Persistence: Maintaining persistent state (last-seen article hashes) to ensure efficiency and minimize redundant work.
- Rate Limit Compliance: Implementing backoff and retry logic to respect external source rate limits.
Is NOT Responsible For:
- Long-Term Storage: The scraper is ephemeral; its output is sent upstream and not stored locally long-term.
- AI Analysis: Performing any AI-based analysis or enrichment. All intelligence is handled by the AI-Box service.
- Direct Serving: Serving data directly to end-users or clients.
- Data Deduplication: Relying on the upstream API for robust deduplication logic.

2. Service Responsibilities & Interactions¶

This table defines the Scraper's role and its critical dependencies within the Labeeb platform ecosystem.

Service	Tech Stack	Core Responsibility	Inputs	Outputs	Depends On
Scraper	Python/FastAPI	Fetches and normalizes articles from external sources based on JSON profiles.	Cron schedules, API calls, External Websites	Batched `Article` JSON objects	API Service

3. Guiding Principles¶

The architecture and operational philosophy of the Scraper service are deeply rooted in these core SRE principles:

Configuration as Code ():
- What: All scraping logic, targets, and schedules are defined in version-controlled JSON profiles (e.g., scraper/profiles/*.json).
- Why: This ensures that all changes are auditable, reviewable via standard Git workflows, and can be rolled back safely.
- How: The filesystem is the single source of truth for profile definitions, eliminating the need for a separate database of sources.
- Benefit: Promotes consistency across environments and simplifies disaster recovery.
Extensibility over Hardcoding ():
- What: The service employs a Provider plugin pattern, allowing new scrapers for specific websites (e.g., AljazeeraProvider) to be added as self-contained Python classes.
- Why: This design minimizes changes to the core application logic when integrating new sources, promoting modularity and independent development.
- How: New providers inherit from BaseProvider and are registered in a central registry, making them discoverable by the profile loader.
- Benefit: Accelerates onboarding of new data sources and reduces the risk of regressions in existing scrapers.
Stateful & Efficient ():
- What: The service tracks the last-seen article for each source category by persisting a URL hash to disk (data/out/state.json).
- Why: This prevents reprocessing of already-seen content, minimizing redundant work, reducing network traffic, and respecting external source rate limits.
- How: Providers query the state before fetching and update it after a successful run, ensuring only new content is processed.
- Benefit: Optimizes resource utilization and maintains good citizenship with external websites.
API-Driven Control ():
- What: While designed for automated operation via its internal scheduler, all core functions (scraping, profile management) are exposed via a REST API.
- Why: This provides a clear, programmatic interface for precise, on-demand control, essential for testing, debugging, and operational interventions.
- How: Endpoints like /scrape and /profiles/reload allow external systems or operators to trigger specific actions.
- Benefit: Enhances operational flexibility and enables integration with external orchestration tools.
Idempotency ():
- What: The scraper is designed to be safely re-run without adverse effects, even if a job fails midway.
- Why: This ensures that retries or manual re-executions do not create duplicate data or cause unexpected side effects.
- How: This is achieved through the state persistence mechanism (tracking last-seen articles) and reliance on the downstream API's robust deduplication capabilities.
- Benefit: Simplifies recovery procedures and increases confidence in automated retries.

4. Architecture at a Glance¶

The Scraper is a self-contained FastAPI application with a clear, decoupled architecture designed for reliability and extensibility. It orchestrates fetching content from external websites and pushing it to the core API.

flowchart TD
    subgraph Triggers
        A[":material-clock-outline: Scheduler (APScheduler)"]
        B[":material-api: API Call (On-Demand)"]
    end

    subgraph "Core Service"
        C[Scraper Engine]
        D[Profile Loader]
        E[Provider Registry]
    end

    subgraph Configuration
        F[":material-file-code: profiles/*.json"]
    end

    subgraph "Providers (Strategies)"
        G[GenericRssProvider]
        H[SiteSpecificProvider]
        I[...]
    end

    subgraph "External Systems"
        L([External Websites]):::ext
        API[(Core API Service)]:::ext
    end

    A --> C
    B --> C
    C --> D
    D --> F
    C --> E
    E --> G & H & I
    G & H & I --> L
    L --> G & H & I
    G & H & I -- Normalized Articles --> C
    C -- Batch Ingest --> API

5. Standard Deployment Process¶

This checklist outlines the standard procedure for deploying the Scraper service to a new environment or updating an existing deployment. This process ensures consistency and minimizes downtime.

Prepare Environment Variables: Ensure all required environment variables are set in your deployment environment (e.g., docker-compose.yml, Kubernetes secrets, or .env file). Refer to the Environment & Configuration page for a complete list.
```
# Example: Ensure INGEST_API_URL and INGEST_TOKEN are set
echo "INGEST_API_URL=${INGEST_API_URL}"
echo "INGEST_TOKEN=${INGEST_TOKEN}"
```
Build Docker Image: Build the Docker image for the Scraper service. Use the Dockerfile located in scraper/docker/.
```
# From the project root
docker build -f scraper/docker/Dockerfile -t labeeb/scraper:latest .
```
Push Docker Image (if applicable): If deploying to a remote environment, push the newly built image to your container registry.
```
docker push labeeb/scraper:latest
```
Deploy to Target Environment: Update your deployment configuration (e.g., docker-compose.yml, Kubernetes deployment) to use the new image tag and apply the changes.
```
# Example for Docker Compose
docker compose up -d --build scraper
```
Verify Deployment: After deployment, perform a quick health check to ensure the service is running and responsive.
```
curl http://localhost:9001/health
```

6. Routine Operations & Maintenance¶

This section covers common operational tasks for the Scraper service.

6.1. Checking Service Health¶

Verify the service is alive and responsive.

curl http://localhost:9001/health
# Expected output: {"status": "ok", ...}

6.2. Monitoring Logs¶

Tail the service logs for real-time activity and error monitoring.

docker compose logs -f scraper

6.3. Managing Profiles¶

List Loaded Profiles: See which profiles are currently active.
```
curl http://localhost:9001/profiles
```
Reload Profiles from Disk: Apply changes to JSON profile files without restarting the service.
```
curl -X POST http://localhost:9001/profiles/reload
```

6.4. Triggering On-Demand Scrapes¶

Manually initiate a scrape for specific sources.

# Scrape 'aljazeera' and 'fatabyyano' profiles, limit to 5 articles each
curl -X POST http://localhost:9001/scrape \
  -H "Content-Type: application/json" \
  -d '{
        "sources": ["aljazeera", "fatabyyano"],
        "limit": 5,
        "send_to_api": true
      }'

7. Structured Incident Playbooks¶

This section provides direct links to detailed runbooks for common operational incidents affecting the Scraper service. These playbooks are designed to be followed step-by-step under pressure.

Source Rate-Limiting (HTTP 429)

Playbook for diagnosing and mitigating HTTP 429 Too Many Requests errors from external data sources.

Go to Playbook
Scraper Backlog & Stuck Scheduler

Playbook for diagnosing and recovering from a stalled scraper or a growing data backlog.

Go to Playbook
Scraping Profile Failures

Playbook for diagnosing and resolving failures in specific scraping profiles, such as from selector drift or content changes.

Go to Playbook

8. Key Documentation¶

This overview is the entry point. For detailed operational procedures and technical specifications, use the following guides:

API Endpoints: An interactive Swagger UI for exploring and testing the Scraper's REST API.
Environment & Configuration: A complete reference for all environment variables and profile schema details.
Service Dependencies: A categorized list of all production and development dependencies.