Scraper Service: The Data Ingestion Playbook¶
Service Status: Operational
This document is the primary operational manual for the Scraper Service. It provides comprehensive, actionable guidance for on-call engineers to deploy, monitor, and troubleshoot this critical data ingestion component of the Labeeb platform. All procedures are designed for clarity, accuracy, and safe execution under pressure.
1. Mission & Scope¶
The Scraper's mission is to be the sole, reliable, and highly adaptable entry point for all external article data into the Labeeb platform.
It operates as a highly available, horizontally scalable microservice responsible for fetching, normalizing, and forwarding content from a diverse and ever-changing set of external sources. The service's architecture prioritizes configuration over code, enabling rapid adaptation to new sources and formats with minimal engineering overhead.
Scope of Responsibilities
-
Is Responsible For:
- Profile Management: Loading and validating a version-controlled library of JSON scraping profiles that define all data sources.
- Job Execution: Executing scraping jobs on a recurring schedule (cron-based) and on-demand via its API.
- Data Normalization: Transforming raw scraped data into the platform's canonical
Articleschema. - Upstream Forwarding: Batched forwarding of normalized data to the core API's ingestion endpoint.
- State Persistence: Maintaining persistent state (last-seen article hashes) to ensure efficiency and minimize redundant work.
- Rate Limit Compliance: Implementing backoff and retry logic to respect external source rate limits.
-
Is NOT Responsible For:
- Long-Term Storage: The scraper is ephemeral; its output is sent upstream and not stored locally long-term.
- AI Analysis: Performing any AI-based analysis or enrichment. All intelligence is handled by the AI-Box service.
- Direct Serving: Serving data directly to end-users or clients.
- Data Deduplication: Relying on the upstream API for robust deduplication logic.
2. Service Responsibilities & Interactions¶
This table defines the Scraper's role and its critical dependencies within the Labeeb platform ecosystem.
| Service | Tech Stack | Core Responsibility | Inputs | Outputs | Depends On |
|---|---|---|---|---|---|
| Scraper | Python/FastAPI | Fetches and normalizes articles from external sources based on JSON profiles. | Cron schedules, API calls, External Websites | Batched Article JSON objects |
API Service |
3. Guiding Principles¶
The architecture and operational philosophy of the Scraper service are deeply rooted in these core SRE principles:
-
Configuration as Code ():
- What: All scraping logic, targets, and schedules are defined in version-controlled JSON profiles (e.g.,
scraper/profiles/*.json). - Why: This ensures that all changes are auditable, reviewable via standard Git workflows, and can be rolled back safely.
- How: The filesystem is the single source of truth for profile definitions, eliminating the need for a separate database of sources.
- Benefit: Promotes consistency across environments and simplifies disaster recovery.
- What: All scraping logic, targets, and schedules are defined in version-controlled JSON profiles (e.g.,
-
Extensibility over Hardcoding ():
- What: The service employs a Provider plugin pattern, allowing new scrapers for specific websites (e.g.,
AljazeeraProvider) to be added as self-contained Python classes. - Why: This design minimizes changes to the core application logic when integrating new sources, promoting modularity and independent development.
- How: New providers inherit from
BaseProviderand are registered in a central registry, making them discoverable by the profile loader. - Benefit: Accelerates onboarding of new data sources and reduces the risk of regressions in existing scrapers.
- What: The service employs a Provider plugin pattern, allowing new scrapers for specific websites (e.g.,
-
Stateful & Efficient ():
- What: The service tracks the last-seen article for each source category by persisting a URL hash to disk (
data/out/state.json). - Why: This prevents reprocessing of already-seen content, minimizing redundant work, reducing network traffic, and respecting external source rate limits.
- How: Providers query the state before fetching and update it after a successful run, ensuring only new content is processed.
- Benefit: Optimizes resource utilization and maintains good citizenship with external websites.
- What: The service tracks the last-seen article for each source category by persisting a URL hash to disk (
-
API-Driven Control ():
- What: While designed for automated operation via its internal scheduler, all core functions (scraping, profile management) are exposed via a REST API.
- Why: This provides a clear, programmatic interface for precise, on-demand control, essential for testing, debugging, and operational interventions.
- How: Endpoints like
/scrapeand/profiles/reloadallow external systems or operators to trigger specific actions. - Benefit: Enhances operational flexibility and enables integration with external orchestration tools.
-
Idempotency ():
- What: The scraper is designed to be safely re-run without adverse effects, even if a job fails midway.
- Why: This ensures that retries or manual re-executions do not create duplicate data or cause unexpected side effects.
- How: This is achieved through the state persistence mechanism (tracking last-seen articles) and reliance on the downstream API's robust deduplication capabilities.
- Benefit: Simplifies recovery procedures and increases confidence in automated retries.
4. Architecture at a Glance¶
The Scraper is a self-contained FastAPI application with a clear, decoupled architecture designed for reliability and extensibility. It orchestrates fetching content from external websites and pushing it to the core API.
flowchart TD
subgraph Triggers
A[":material-clock-outline: Scheduler (APScheduler)"]
B[":material-api: API Call (On-Demand)"]
end
subgraph "Core Service"
C[Scraper Engine]
D[Profile Loader]
E[Provider Registry]
end
subgraph Configuration
F[":material-file-code: profiles/*.json"]
end
subgraph "Providers (Strategies)"
G[GenericRssProvider]
H[SiteSpecificProvider]
I[...]
end
subgraph "External Systems"
L([External Websites]):::ext
API[(Core API Service)]:::ext
end
A --> C
B --> C
C --> D
D --> F
C --> E
E --> G & H & I
G & H & I --> L
L --> G & H & I
G & H & I -- Normalized Articles --> C
C -- Batch Ingest --> API
5. Standard Deployment Process¶
This checklist outlines the standard procedure for deploying the Scraper service to a new environment or updating an existing deployment. This process ensures consistency and minimizes downtime.
-
Prepare Environment Variables: Ensure all required environment variables are set in your deployment environment (e.g.,
docker-compose.yml, Kubernetes secrets, or.envfile). Refer to the Environment & Configuration page for a complete list. -
Build Docker Image: Build the Docker image for the Scraper service. Use the
Dockerfilelocated inscraper/docker/. -
Push Docker Image (if applicable): If deploying to a remote environment, push the newly built image to your container registry.
-
Deploy to Target Environment: Update your deployment configuration (e.g.,
docker-compose.yml, Kubernetes deployment) to use the new image tag and apply the changes. -
Verify Deployment: After deployment, perform a quick health check to ensure the service is running and responsive.
6. Routine Operations & Maintenance¶
This section covers common operational tasks for the Scraper service.
6.1. Checking Service Health¶
Verify the service is alive and responsive.
6.2. Monitoring Logs¶
Tail the service logs for real-time activity and error monitoring.
6.3. Managing Profiles¶
-
List Loaded Profiles: See which profiles are currently active.
-
Reload Profiles from Disk: Apply changes to JSON profile files without restarting the service.
6.4. Triggering On-Demand Scrapes¶
Manually initiate a scrape for specific sources.
# Scrape 'aljazeera' and 'fatabyyano' profiles, limit to 5 articles each
curl -X POST http://localhost:9001/scrape \
-H "Content-Type: application/json" \
-d '{
"sources": ["aljazeera", "fatabyyano"],
"limit": 5,
"send_to_api": true
}'
7. Structured Incident Playbooks¶
This section provides direct links to detailed runbooks for common operational incidents affecting the Scraper service. These playbooks are designed to be followed step-by-step under pressure.
-
Source Rate-Limiting (HTTP 429)
Playbook for diagnosing and mitigating
HTTP 429 Too Many Requestserrors from external data sources. -
Scraper Backlog & Stuck Scheduler
Playbook for diagnosing and recovering from a stalled scraper or a growing data backlog.
-
Scraping Profile Failures
Playbook for diagnosing and resolving failures in specific scraping profiles, such as from selector drift or content changes.
8. Key Documentation¶
This overview is the entry point. For detailed operational procedures and technical specifications, use the following guides:
- API Endpoints: An interactive Swagger UI for exploring and testing the Scraper's REST API.
- Environment & Configuration: A complete reference for all environment variables and profile schema details.
- Service Dependencies: A categorized list of all production and development dependencies.