title: Runbook: ML Model Loading Failure description: Diagnose and resolve failures when a model fails to load in AI-Box. icon: material/cpu-64-bit
Model Loading Failure¶
Impact: Critical — service start failure
Failures in loading an optional model (e.g., S1 when S1_BACKEND=hf, or a reranker) can prevent startup or crash during first use.
Triage (≤5 minutes)¶
-
Inspect container logs
Look for Python tracebacks mentioning model IDs/paths. -
Identify the failing component
- S1 (AIB-15): controlled by
ENABLE_AIB_15,S1_BACKEND,S1_MODEL_ID -
Reranker: controlled by
ENABLE_RERANK,RERANK_MODEL_ID -
Check env/config
- Confirm paths/IDs and that heavy deps exist only if needed.
- Default image may not include
transformers/torch; usingS1_BACKEND=hfwithout them will fail by design.
Remediation¶
- Verify
S1_MODEL_IDorRERANK_MODEL_IDis correct. - If mounting local models, confirm volume:
- Rebuild/restart:
- Remove the local HF cache inside container and restart:
- Increase container memory limits or pick a smaller/quantized model.
- S1:
ENABLE_AIB_15=false - Reranker:
ENABLE_RERANK=false
Post-incident¶
- Add model health to
/health(lazy probe w/ cache). - Document RAM/CPU needs per model in Requirements.