title: Runbook: Model Cache Cold description: First-request slowness due to cold model/tokenizer caches. icon: material/snowflake
Model Cache Cold¶
Symptoms¶
- First request after deploy/restart is significantly slower than subsequent requests.
/metricsshows a spike in request duration immediately post-deploy.
Likely causes¶
- Lazy loading of models/tokenizers on first use (e.g., S1 or reranker if enabled).
- Empty on-disk cache after fresh container start.
Mitigation (warm the cache)¶
Warm representative code paths once after deployment:
# Warm retrieval (BM25)
curl -s ${AIBOX_URL:-http://localhost:8001}/retrieve -H 'content-type: application/json' -d '{"query":"warmup","lang":"en","k_bm25":10,"k_knn":0,"k_rrf":60,"rerank":false}' > /dev/null
# Warm Evidence Pack (no hydration required)
curl -s ${AIBOX_URL:-http://localhost:8001}/retrieve_pack -H 'content-type: application/json' -d '{"query":"warmup","lang":"en","k_bm25":10,"k_knn":0,"k_rrf":60,"rerank":false}' > /dev/null
# (If AIB-15 enabled) Warm S1 Check-worthiness
curl -s ${AIBOX_URL:-http://localhost:8001}/s1/score -H 'content-type: application/json' -d '{"text":"warmup","lang":"en"}' > /dev/null
Automate the warmup
Add these commands to your deployment post-start hook or CI job that validates readiness.
Verification¶
- Compare
/metricsaibox_request_duration_secondsp95 before/after warmup. - Manual smoke should feel instant after the first request.