Skip to content

Labeeb Runbook (On-Call)

Single page for on-call engineers to restore ${SERVICE} with clarity, accuracy, and safety.

How to use this page

  • Read top-to-bottom on first response.
  • Use the checklists; copy/paste commands as-is.
  • Escalation paths are at the bottom.

At-a-Glance

  • Environments: [PROD_BADGE] [STAGE_BADGE] [DEV_BADGE]
  • Services: api · ai-box · scraper · search
  • Quick Links: Dashboards · Alerts · Runbooks
Signal Owner Probe
Latency ${OWNER_SRE} curl -w "%{time_total}\n" -o /dev/null -s ${API_URL}/health
Errors ${OWNER_SRE} docker compose logs --tail=50
Saturation ${OWNER_SRE} docker stats --no-stream
Traffic ${OWNER_SRE} curl -s ${API_URL}/metrics | jq '.http_requests'

On-call Overview

  • First checks → health endpoints + traces
  • Then → go to the specific runbook (API/AI-Box/Scraper/Search)
  • Finally → consult the incident playbooks
curl -s ${API_URL}/health | jq
curl -s ${SCRAPER_URL}/health | jq
curl -s ${AI_BOX_URL}/health | jq
curl -s ${OS_URL}/_cluster/health | jq '.status'
curl -s -D- ${API_URL}/v1/search?q=test -H "X-Request-ID=oncall-$RANDOM" | sed -n '1,40p'

See also: Operational Contracts

First 5 Minutes — Universal Checklist

  1. Confirm incident scope (user-facing? ingestion-only? search-only?).
    curl -s ${API_URL}/health | jq
    curl -s ${SCRAPER_URL}/health | jq
    curl -s ${AI_BOX_URL}/health | jq
    
  2. Check health endpoints across services.
    docker compose ps
    
  3. Tail logs with filters.
    docker compose logs -f --tail=200 ${SERVICE} | grep -i "error"
    
  4. Capture context (links placeholders).
    • ${TICKET_URL}
    • ${DASHBOARD_URL}

# Start all services
docker compose up -d

# Stop all services
docker compose down

# Tail logs for a service
docker compose logs -f <service>
# Health check
curl -s ${API_URL}/v1/health

# Authenticated request
curl -s -H "Authorization: Bearer ${INGEST_TOKEN}" -X POST ${API_URL}/v1/ingest/articles -d @payload.json

Service Status & Health Commands

api

  • Health: GET ${API_URL}/health → JSON {"status":"ok"}
    curl -s ${API_URL}/health | jq
    docker compose logs api -n 20
    docker compose restart api
    
  • Depends on: PostgreSQL, Redis, OpenSearch

ai-box

  • Health: GET ${AI_BOX_URL}/health → JSON {"status":"ok"}
    curl -s ${AI_BOX_URL}/health | jq
    docker compose logs ai-box -n 20
    docker compose restart ai-box
    
  • Depends on: OpenSearch, model cache

scraper

  • Health: GET ${SCRAPER_URL}/health → JSON {"status":"ok"}
    curl -s ${SCRAPER_URL}/health | jq
    docker compose logs scraper -n 20
    docker compose restart scraper
    
  • Depends on: api
  • Health: GET ${OS_URL}/_cluster/health → field status
    curl -s ${OS_URL}/_cluster/health | jq '.status'
    docker compose logs search -n 20
    docker compose restart search
    
  • Depends on: disk, JVM heap

Common Runbooks (Pointers)

Safety Rails

Do not

  • Purge indices in production without a recent snapshot.
  • Reindex with stale mappings.
  • Roll ai-box without a warm model cache.

Pre-flight checks

  • Verify snapshots.
  • Verify .env and compose overrides for the target environment.

Quick Probes (Copy/Paste)

curl -s -o /dev/null -w "%{time_total}\n" ${API_URL}/health
curl -s ${API_URL}/v1/search?q=test | jq '.hits | length'
curl -s ${API_URL}/metrics | jq '.http_requests'
curl -s ${AI_BOX_URL}/health | jq
curl -s ${AI_BOX_URL}/s1/check-worthiness -d '{"text":"test"}' | jq '.score'
curl -s ${AI_BOX_URL}/metrics | jq '.inflight'
curl -s ${SCRAPER_URL}/health | jq
curl -s ${SCRAPER_URL}/profiles | jq '. | length'
docker compose logs scraper -n 20
curl -s ${OS_URL}/_cluster/health | jq '.status'
curl -s ${OS_URL}/${INDEX}/_count | jq '.count'
curl -s -H 'Content-Type: application/json' ${OS_URL}/${INDEX}/_search -d '{"query":{"match_all":{}}}' | jq '.hits.total'

Triage Matrix

Symptom Likely Root Cause Probe Next Action
Search slow OpenSearch overload curl -s ${OS_URL}/_nodes/stats/jvm?pretty Scale nodes or clear heavy queries
429 from sources Upstream rate limit docker compose logs scraper | grep 429 Backoff & adjust profile schedule
Reranker timeout AI-Box saturation curl -s ${AI_BOX_URL}/metrics | jq '.latency_reranker' Restart ai-box; check model cache
Auth timeouts API or DB latency curl -s ${API_URL}/health | jq Restart API or database

Minimal Architecture (Orientation)

flowchart TD
    FE[FE] --> API
    API --> Queue
    Queue --> AIB[AI-Box]
    AIB --> OS[OpenSearch]
    SCR[Scraper] --> API

Escalation

SEV Description Response SLO
SEV-1 Full outage ≤5 min
SEV-2 Major user impact ≤15 min
SEV-3 Partial degradation ≤1 hr
SEV-4 Minor issue Next business day
  • Contacts: ${PRIMARY_ONCALL}, ${BACKUP_ONCALL}, ${ENG_MANAGER}
  • Handover checklist:
    • Update incident channel.
    • Link dashboards and logs.
    • Transfer open actions.

Appendices

Placeholder Description
${API_URL} Base URL for API service
${SCRAPER_URL} Base URL for Scraper service
${AIBOX_URL} Base URL for AI-Box service
${SEARCH_URL} Base URL for OpenSearch cluster
Variable Description
---------- -------------
EXAMPLE_VAR Example description

Note

Follow Conventional Commits.

Warning

Commands assume execution from the repository root unless noted.

Last updated: ${DATE} · Version: ${DOCS_VERSION}