Skip to content

AI-Box Operations Runbook

This document provides step-by-step, checklist-style procedures for all common operational tasks related to the AI-Box service. It is designed for clarity and action under pressure.


Procedure: Diagnosing Search Relevance Issues

Objective

To triage a user report of "bad" or irrelevant search results from the /retrieve endpoint. This procedure helps isolate whether the issue is with keyword search, vector search, or the ranking logic.

Checklist

  1. Replicate the Exact Query:

    • Use curl to send the exact request payload that is producing irrelevant results.
      curl -X POST http://localhost:8001/retrieve -H "Content-Type: application/json" -d '{
        "query": "<the exact user query>",
        "lang": "en",
        "k_bm25": 20,
        "k_knn": 20
      }'
      
  2. Isolate the BM25 (Keyword) Leg:

    • Re-run the query with vector search disabled (k_knn: 0). This shows you the raw keyword search results.
      curl -X POST http://localhost:8001/retrieve -H "Content-Type: application/json" -d '{
        "query": "<the exact user query>",
        "k_bm25": 20,
        "k_knn": 0
      }'
      
    • Analyze: Are these results relevant? If not, the issue may be with the text analysis configuration in OpenSearch.
  3. Isolate the k-NN (Vector) Leg:

    • Re-run the query with keyword search disabled (k_bm25: 0). This shows you the raw semantic search results.
      curl -X POST http://localhost:8001/retrieve -H "Content-Type: application/json" -d '{
        "query": "<the exact user query>",
        "k_bm25": 0,
        "k_knn": 20
      }'
      
    • Analyze: Are these results semantically related to the query? If not, the issue may be with the embedding model or the vector index.
  4. Check OpenSearch Directly:

    • If one of the legs is returning poor results, construct a raw OpenSearch query to bypass the AI-Box entirely. This can confirm if the issue is in the AI-Box's query construction or in the search cluster itself.

Incident Response: High Search Latency

Incident Priority: High

Symptom: The aibox_request_duration_seconds metric for the /retrieve or /retrieve_pack endpoints is elevated, or API calls are timing out.

Triage & Recovery Checklist

  1. Check AI-Box Service Logs:

    • Action: docker compose logs -f ai-box.
    • Look for: Any obvious errors, warnings, or timeouts in the application logs.
  2. Check OpenSearch Cluster Health:

    • Action: High search latency in the AI-Box is almost always caused by high search latency in OpenSearch. Check the OpenSearch cluster's CPU, memory, and query performance via your monitoring dashboards.
    • Verify: curl http://localhost:9200/_cluster/health?pretty.
  3. Inspect Query Diagnostics:

    • Action: Re-run a slow query and inspect the diagnostics object in the JSON response.
    • Analyze: The bm25_ms and knn_ms fields will tell you exactly which part of the hybrid search is slow.
  4. Check for Expensive Queries:

    • A very broad query, a query with complex filters, or a very high k value can cause high latency. Review the slow query for any obvious issues.

Known Errors & Quick Fixes

parsing_exception: Unknown key for a START_OBJECT in [knn]

Use the query-form for kNN within query:

curl -s -XPOST localhost:9200/news_docs/_search -H 'content-type: application/json' -d '{
  "size": 2,
  "query": {
    "knn": {
      "embedding": { "vector": [0.1,0.2,0.3,0.4], "k": 2 }
    }
  }
}'

zero vector is not supported when space type is [cosinesimil]

Ensure non-zero unit vectors when testing kNN queries. Generate one quickly:

python - <<'PY'
import random, math, json; random.seed(42)
v=[random.random() for _ in range(4)]
n=math.sqrt(sum(x*x for x in v)); print(json.dumps([x/n for x in v]))
PY

pipeline with id [text_embed_news] does not exist

Either remove the pipeline parameter from the indexing call or create a stub pipeline before indexing.

strict_dynamic_mapping_exception

The index uses dynamic: "strict". Add fields via PUT _mapping or prefer enriching at read-time through /retrieve_pack hydration instead of _update.

502 "Search backend error"

  • Verify OpenSearch health: curl -s localhost:9200/_cluster/health
  • Confirm env: OS_INDEX, VECTOR_FIELD match mapping; list indices: curl -s localhost:9200/_cat/indices
  • Run a minimal _search to validate the index responds.

Metrics quick queries

  • RRF time (avg): rate(aibox_retrieval_rrf_ms_sum[5m]) / rate(aibox_retrieval_rrf_ms_count[5m])
  • Request latency p95: histogram_quantile(0.95, rate(aibox_request_duration_seconds_bucket[5m]))
  • Error rate: rate(aibox_requests_total{status=~"5.."}[5m])