Skip to content

AI-Box Observability Guide

This document is the SRE guide to the observability stack of the AI-Box service. It details the key signals the service emits and how to use them to assess its health and performance.

The Three Pillars of Observability

Our monitoring strategy is built on three pillars:

  • Metrics: Aggregated numerical data that provides a high-level view of system health (e.g., request rates, error rates, latency percentiles).
  • Logging: Detailed, structured records of discrete events, essential for deep debugging and root cause analysis.
  • Tracing: A view of the entire lifecycle of a request as it flows through multiple services. (Note: Distributed tracing is a future goal for the platform.)

1. Metrics (Prometheus)

The service exposes a wide range of metrics in a Prometheus-compatible format at the /metrics endpoint. These are the primary source for dashboards and alerting.

Locked Down

The metrics endpoint is disabled by default outside of local dev. Provide a METRICS_TOKEN env var and call with X-Metrics-Token to enable.

export METRICS_TOKEN=dev
curl -H "X-Metrics-Token: $METRICS_TOKEN" http://localhost:8001/metrics

In production, use a strong token and restrict exposure to trusted networks only.

METRICS_TOKEN lives in the service's .env file. Rotate it by updating the value in your secret store and Prometheus scrape config, then redeploy the service. Once the new token is confirmed, remove the old one.

Metrics Reference

Metric Type Labels Units Description
aibox_requests_total Counter route, method, code requests HTTP requests by route/method/status
aibox_request_duration_seconds Histogram route seconds Request duration per route
aibox_retrieval_rrf_ms Histogram milliseconds RRF fusion time
aibox_rerank_ms Histogram milliseconds Rerank model time
aibox_rrf_mode Gauge mode 1 Active RRF mode
aibox_rrf_fallback_total Counter fallbacks Times Python RRF fallback used
s1_requests_total Counter requests S1 check-worthiness requests
s1_latency_seconds Histogram seconds S1 latency
aibox_s1_mode Gauge mode 1 Active S1 mode

Key Performance Indicators (KPIs)

Metric Prometheus Query Threshold (Example) Why It Matters
P95 Latency histogram_quantile(0.95, rate(aibox_request_duration_seconds_bucket[5m])) > 500ms Indicates a slow response time for the majority of users. The primary measure of user experience.
Error Rate rate(aibox_requests_total{code=~"5.."}[5m]) / rate(aibox_requests_total[5m]) > 2% A high error rate indicates a systemic problem with the service or its dependencies.
Request Rate rate(aibox_requests_total[5m]) N/A Provides a baseline of service traffic. Sudden drops can indicate upstream issues.
CPU Usage container_cpu_usage_seconds_total{container="ai-box"} > 85% Sustained high CPU can lead to increased latency and request queuing.
Memory Usage container_memory_usage_bytes{container="ai-box"} > 90% High memory usage risks the container being OOM-killed by the orchestrator.

Per-Request Diagnostics

In addition to Prometheus metrics, the response body of the /retrieve and /retrieve_pack endpoints includes a diagnostics object with detailed timings for each stage of the retrieval process. This is invaluable for debugging specific slow queries.

Example /retrieve Response with Diagnostics
{
  ...
  "diagnostics": {
    "bm25_ms": 5.0,
    "knn_ms": 4.8,
    "rrf_ms": 0.01,
    "rerank_ms": 0.0
  }
}

Dashboards & Alerts


2. Logging

The service uses the python-json-logger library to emit structured logs in JSON format. This is a critical feature for production-grade observability.

Why Structured Logs?

JSON logs are machine-readable, which allows them to be easily ingested, parsed, and queried in a centralized logging platform (like OpenSearch, Loki, or Splunk). This enables powerful searching and filtering (e.g., Show me all logs with level=ERROR for the /retrieve endpoint).

Example Log Entry

{
  "timestamp": "2025-08-25T10:30:00.123Z",
  "level": "INFO",
  "message": "Hybrid search completed",
  "route": "/retrieve",
  "query": "elections in syria",
  "results_count": 20,
  "duration_ms": 152.4
}

3. Tracing (Future Goal)

Distributed tracing is not yet implemented in the Labeeb platform. However, it is the next logical step in our observability journey.

  • What it is: Tracing provides a way to visualize the entire lifecycle of a request as it moves from the client, through the API service, to the AI-Box, and finally to the OpenSearch cluster. Each step in the journey is a "span," and the collection of spans for a single request is a "trace."
  • Why it matters: It is the single most powerful tool for debugging latency issues in a microservices architecture, as it can pinpoint exactly which service or which operation is the bottleneck.