title: Runbook: Model Cache Cold description: First-request slowness due to cold model/tokenizer caches. icon: material/snowflake

Model Cache Cold¶

Symptoms¶

First request after deploy/restart is significantly slower than subsequent requests.
/metrics shows a spike in request duration immediately post-deploy.

Likely causes¶

Lazy loading of models/tokenizers on first use (e.g., S1 or reranker if enabled).
Empty on-disk cache after fresh container start.

Mitigation (warm the cache)¶

Warm representative code paths once after deployment:

# Warm retrieval (BM25)
curl -s ${AIBOX_URL:-http://localhost:8001}/retrieve -H 'content-type: application/json'   -d '{"query":"warmup","lang":"en","k_bm25":10,"k_knn":0,"k_rrf":60,"rerank":false}' > /dev/null

# Warm Evidence Pack (no hydration required)
curl -s ${AIBOX_URL:-http://localhost:8001}/retrieve_pack -H 'content-type: application/json'   -d '{"query":"warmup","lang":"en","k_bm25":10,"k_knn":0,"k_rrf":60,"rerank":false}' > /dev/null

# (If AIB-15 enabled) Warm S1 Check-worthiness
curl -s ${AIBOX_URL:-http://localhost:8001}/s1/score -H 'content-type: application/json'   -d '{"text":"warmup","lang":"en"}' > /dev/null

Automate the warmup

Add these commands to your deployment post-start hook or CI job that validates readiness.

Verification¶

Compare /metrics aibox_request_duration_seconds p95 before/after warmup.
Manual smoke should feel instant after the first request.