Autor: Jakub Rusinowski · Ostatnia aktualizacja: 15 czerwca 2026
Founder, LLM Configurator — AI educator & workshop leader on local LLM deployment
llama_new_context_with_model: failed to allocate KV cache
The model loads without complaint and answers short prompts — then falls over once you feed it a long document or a long chat history. The crash comes at *generation* time on big inputs, not at load time, which is what makes it confusing: the weights clearly fit, yet it still runs out of memory.
Two different things draw on your VRAM: the model weights (fixed) and the KV cache (grows with context length). The KV cache stores keys/values for every token in the window, so memory use climbs roughly linearly with how many tokens you're holding. A model that fits comfortably at 4K context can need gigabytes more at 32K or 128K — and that extra is what tips a card that was previously "just fitting" into OOM. This is the classic Tight fit: fine on paper, no headroom for context.
The mistake is budgeting only for the weights. You need room for the weights plus the KV cache at the context length you actually use. The reliable way to avoid this is to pick a model/quant that leaves headroom on your card for the context you need — a "Tight" fit with no room for KV cache is exactly what OOMs on long inputs. Check what fits *with* your target context, not just the bare weights.
If you don't need a 128K window, don't allocate one. Setting the context to the size of your actual inputs shrinks the KV cache directly. This is the single most effective knob.
# llama.cpp: cap the context
./llama-cli -m model.gguf -c 8192
# Ollama:
OLLAMA_CONTEXT_LENGTH=8192 ollama run <model>
llama.cpp can store the KV cache at lower precision (e.g. q8_0 instead of f16), roughly halving its memory for a small quality cost. Useful when you genuinely need a long context on a tight card.
./llama-cli -m model.gguf -c 32768 \
--cache-type-k q8_0 --cache-type-v q8_0
If long context is non-negotiable, a smaller or more aggressively quantized model frees the VRAM the KV cache needs. Trading a little weight precision for a lot of usable context is often the right call for document work.
Because the KV cache grows with context length. Short prompts use little of it; long documents or chats need much more, and that extra memory is what pushes a tightly-fitting model over its VRAM limit.
It depends on the model, but it scales roughly linearly with token count and can reach several gigabytes at very long contexts. That is why you should budget for weights plus context together, not weights alone.
Q8 KV cache is close to lossless in practice and roughly halves cache memory. More aggressive cache quantization can affect long-context coherence, so q8_0 is the usual sweet spot.