Endpoints¶
The Scraper service provides a comprehensive FastAPI interface for profile-driven web scraping and content normalization. All endpoints are designed for both scheduled and on-demand operation.
- /health --- Liveness + dependency checks. Returns summary JSON; 200 on success.
- /metrics --- Prometheus exposition format. Token-protected in non-debug mode.
- /profiles --- List all available profiles with their configuration status.
- /profiles/reload (POST) --- Reload profiles from disk (Git‑ops friendly). Validates and applies changes without restart.
- /profiles/{name}/categories --- List categories available for a specific profile.
- /runs (POST)
---
Trigger a run:
{ "profile_id":"…", "limit":100 } - /runs (GET) --- Query historical runs with filters.
- /replay (POST) --- Guarded endpoint to re‑emit cached items.
- /scrape (POST) --- On‑demand scrape for profiles, with comprehensive filtering options.
GET /health¶
Verify that the service is running and check dependency status.
200 OK
Failure example
Example:
GET /metrics¶
Prometheus endpoint for basic counters and gauges.
- Endpoint:
GET /metrics - Headers (non-debug only):
X-Metrics-Token: <token>
Key Metrics: - Request counters and latency histograms - Profile execution statistics - Provider success/failure rates
GET /profiles¶
List all available profiles with their configuration and status.
200 OK
[
{
"name": "aljazeera",
"provider": "aljazeera",
"enabled": true,
"language": "ar",
"schedule": "*/30 * * * *"
}
]
GET /profiles/{name}/categories¶
List categories available for a specific profile.
Example:
200 OK
POST /profiles/reload¶
Reload profiles from disk to apply changes without restart. Use this endpoint after you add, remove, or edit any of the JSON files in the /profiles directory to make the changes take effect without restarting the service.
Validation: Profiles are validated against the JSON Schema in app/data/schemas/profile.schema.json. Invalid files are logged and skipped.
POST /runs¶
202 Accepted
POST /scrape (on‑demand)¶
Trigger an immediate scrape for one or more sources with comprehensive filtering options.
Request Body Parameters:
- sources (array, optional): A list of profile names (e.g., ["aljazeera", "fatabyyano"]). Defaults to all enabled profiles.
- query (string, optional): A keyword to filter article titles and content.
- categories (array, optional): Filter by specific categories (e.g., ["fake", "real"]).
- limit (integer, optional): A hard limit on the number of articles to return per source.
- write_to_disk (boolean, optional): If true, appends the output to a .jsonl file in /app/data/out.
- send_to_api (boolean, optional): If true (default), sends scraped articles to the ingest API.
POST /scrape
Content-Type: application/json
{
"sources": ["verify_sy"],
"query": "سوريا",
"categories": ["fake"],
"limit": 10,
"write_to_disk": true,
"send_to_api": false
}
200 OK returns list of normalized items or file path if write_to_disk=true.
Response:
{
"results": [
{
"id": "unique_article_id",
"title": "Article Title",
"content": "Article content...",
"url": "https://source.com/article",
"published_at": "2025-01-01T00:00:00Z",
"source": "Source Name",
"category": "politics",
"language": "ar"
}
],
"metadata": {
"total_articles": 1,
"sources_processed": ["verify_sy"],
"processing_time_ms": 1250
}
}
Auth
Protect write endpoints with service auth (token or IP allow‑list).
Interactive API Console¶
Last updated: 2025-08-25 · Docs version: v0.3