Skip to content

title: Runbook: Model Cache Cold description: First-request slowness due to cold model/tokenizer caches. icon: material/snowflake


Model Cache Cold

Symptoms

  • First request after deploy/restart is significantly slower than subsequent requests.
  • /metrics shows a spike in request duration immediately post-deploy.

Likely causes

  • Lazy loading of models/tokenizers on first use (e.g., S1 or reranker if enabled).
  • Empty on-disk cache after fresh container start.

Mitigation (warm the cache)

Warm representative code paths once after deployment:

# Warm retrieval (BM25)
curl -s ${AIBOX_URL:-http://localhost:8001}/retrieve -H 'content-type: application/json'   -d '{"query":"warmup","lang":"en","k_bm25":10,"k_knn":0,"k_rrf":60,"rerank":false}' > /dev/null

# Warm Evidence Pack (no hydration required)
curl -s ${AIBOX_URL:-http://localhost:8001}/retrieve_pack -H 'content-type: application/json'   -d '{"query":"warmup","lang":"en","k_bm25":10,"k_knn":0,"k_rrf":60,"rerank":false}' > /dev/null

# (If AIB-15 enabled) Warm S1 Check-worthiness
curl -s ${AIBOX_URL:-http://localhost:8001}/s1/score -H 'content-type: application/json'   -d '{"text":"warmup","lang":"en"}' > /dev/null

Automate the warmup

Add these commands to your deployment post-start hook or CI job that validates readiness.

Verification

  • Compare /metrics aibox_request_duration_seconds p95 before/after warmup.
  • Manual smoke should feel instant after the first request.