title: Runbook: Scraper Backlog & Stuck Scheduler description: A playbook for diagnosing and recovering from a stalled scraper or a growing data backlog. icon: material/backup-restore
Runbook: Scraper Backlog & Stuck Scheduler¶
Impact: High - Stale Platform Data
This alert fires when the Scraper service stops processing new articles, either due to a scheduler failure or a persistent inability to fetch/ingest data. The direct impact is that the entire Labeeb platform will be operating on stale data, and new events will not be reflected in search results or analysis.
Triage Checklist (5 Minutes)¶
Your immediate goal is to determine why the scraper is not processing data. Follow these steps methodically.
-
Verify Service Health: First, confirm the service is running and responsive.
-
Check for Scheduler Activity in Logs: Inspect the logs for messages from the
APSchedulercomponent. A lack of recent "scheduled" or "running job" messages indicates a stalled scheduler. -
Check for Errors: Look for obvious errors like network timeouts, crashes, or repeated exceptions.
-
Check Upstream API Health: The scraper depends on the main API to ingest data. If the API is down, the scraper will be blocked. Check the API's health.
Remediation Playbooks¶
Based on your triage, select the appropriate playbook to resolve the issue.
Symptom: The logs show the scheduler is running, but no new jobs are being executed for a specific profile. This often points to a stale lock from a previous run that failed without cleanup.
Manual State Intervention Required
This procedure involves manually editing the service's state file. This is a high-risk operation. Proceed with caution and make a backup of the file before making any changes.
-
Enter the Container: Open a shell inside the running scraper container.
-
Backup the State File: The state file (
state.json) is the source of truth for the last-seen articles. Create a timestamped backup. -
Inspect the State File: Examine the contents of the state file to identify the stale entry. Look for the profile that is no longer running.
-
Clear the Stale Lock: Using a text editor inside the container (like
viornano), carefully remove the entry for the stalled profile or category. Alternatively, for a full reset of a single profile, you can usejq. -
Trigger a Manual Scrape: Exit the container and trigger a manual scrape for the affected profile to verify that it now runs correctly.
Symptom: The scraper logs are filled with connection errors or HTTP 5xx status codes when trying to contact the main API service.
-
Confirm API Unavailability: Follow the triage steps in the API Service Runbook to diagnose and resolve the issue with the main API.
-
Temporarily Disable Ingestion: If the API service requires extended downtime, you can prevent the scraper from generating further errors by configuring it to write to disk instead of sending to the API. This is done by submitting a
POSTrequest to the/scrapeendpoint withsend_to_apiset tofalse.Note
This is a temporary measure. Once the API is restored, the data written to disk will need to be manually ingested.
Symptom: The scraper is running but is extremely slow, or the container is frequently restarting.
-
Check Container Resource Usage: Use
docker statsto check the CPU and memory usage of the scraper container. -
Increase Resources: If the container is hitting its resource limits, increase the memory or CPU allocation in your
docker-compose.ymlfile.
Post-Incident Actions¶
- Root Cause Analysis: Determine why the job stalled or the lock was not released. Was it a network blip, a bug in a provider, or a non-graceful shutdown?
- Improve Lock Management: Investigate adding a TTL (Time To Live) to the job locks in
state.pyso that they expire automatically after a reasonable period. - Enhance Health Checks: The
/healthendpoint could be improved to include the status of the scheduler and the age of the last successfully completed job.