title: Runbook: Scraping Profile Failures description: A playbook for diagnosing and resolving failures in specific scraping profiles, such as from selector drift or content changes. icon: material/file-alert-outline
Runbook: Scraping Profile Failures¶
Impact: Data Quality Degradation
This alert fires when a specific scraping profile is consistently failing to fetch or process articles, while other profiles are succeeding. This leads to a loss of data from the affected source, impacting the completeness and timeliness of the Labeeb platform.
Triage Checklist (5 Minutes)¶
Your immediate goal is to isolate the failing profile and understand the nature of the failure.
-
Identify the Failing Profile: Check the service logs for repeated error messages associated with a specific profile name.
-
Isolate the Failure: Trigger a manual, on-demand scrape for only the suspected profile. This provides a clean set of logs and a direct response to analyze.
# Replace 'problem-source' with the name of the failing profile curl -X POST http://localhost:9001/scrape -H "Content-Type: application/json" -d '{ "sources": ["problem-source"], "limit": 5, # Keep the limit low for a fast test "send_to_api": false, # Prevent sending partial/bad data upstream "write_to_disk": false }' -
Analyze the Response & Logs:
- Check the
failedarray in the JSON response from the previous command. - Immediately after the manual scrape, tail the logs again for the detailed error message or Python traceback. This will tell you why it failed.
- Check the
Remediation Playbooks¶
Based on your triage, select the appropriate playbook to resolve the issue.
Symptom: The logs show errors related to parsing (e.g., AttributeError: 'NoneType' object has no attribute 'select_one') or the scrape returns zero articles from a source that should have many. This is the most common cause of failure and happens when a website redesigns its HTML layout.
-
Get the Target URL: Open the profile's JSON file in
scraper/profiles/and find a URL from thestart_urlslist. -
Inspect the Live HTML: Use
curlfrom within the container to fetch the live HTML of the page, or open it in a browser and use the developer tools. -
Compare HTML to Selectors:
- If using the
generic_htmlprovider, check the CSS selectors in the profile'smeta.selectorsobject. - If using a site-specific provider (e.g.,
aljazeera.py), check the hardcoded selectors in the provider's Python file.
- If using the
-
Update the Selectors: Modify the selectors in the appropriate file to match the new HTML structure.
-
Test the Fix: Re-run the isolated manual scrape command from the triage step. A successful response with a
countgreater than zero indicates the fix was successful. -
Reload All Profiles: Once verified, apply the change permanently by reloading all profiles.
Symptom: The logs show a Traceback from a parsing library like dateparser or newspaper3k. This can happen if a website changes its date format or article structure in a subtle way.
-
Identify the Erroring Library: The Python traceback in the logs will clearly name the library that is failing (e.g.,
dateparser.parse()). -
Isolate the Problem Content: The logs should also contain the specific URL of the article that failed to parse. Manually inspect the content at that URL.
-
Implement a Code Fix: This type of error almost always requires a code change in the provider's Python file in
scraper/app/scraping/providers/. You may need to add more robust error handling or adjust the logic to account for the new content format. -
Deploy the Fix: This change requires a new Docker image to be built and deployed.
Symptom: The logs show connection timeouts, 403 Forbidden, or 404 Not Found errors for all requests to a specific domain.
-
Verify External Status: Confirm the website is down for everyone, not just our scraper. Use an external tool like a web browser.
-
Disable the Profile: The safest and most immediate action is to temporarily disable the profile to stop generating failing requests. Follow the procedure in the Source Rate-Limiting Runbook.
-
Notify Stakeholders: Inform the relevant team that a data source is offline and that data will be missing until it is restored.
Post-Incident Actions¶
- Improve Selector Robustness: For critical sources, consider adding fallback selectors or using more resilient selection logic (e.g., searching for microdata schemas instead of CSS classes).
- Create Regression Tests: For complex, site-specific providers, add a simple test case to the
scraper/tests/directory that fetches a saved copy of a known-good HTML page and asserts that the selectors can still parse it. This catches selector drift in CI before it reaches production.