Out of memory at long context (the KV cache, not the weights)

Name: LLM Configurator — GPU VRAM Checker
Author: LLM Configurator

Autor: Jakub Rusinowski · Ostatnia aktualizacja: 15 czerwca 2026

Founder, LLM Configurator — AI educator & workshop leader on local LLM deployment

The error

llama_new_context_with_model: failed to allocate KV cache

When you see it

The model loads without complaint and answers short prompts — then falls over once you feed it a long document or a long chat history. The crash comes at *generation* time on big inputs, not at load time, which is what makes it confusing: the weights clearly fit, yet it still runs out of memory.

What's actually going on

Two different things draw on your VRAM: the model weights (fixed) and the KV cache (grows with context length). The KV cache stores keys/values for every token in the window, so memory use climbs roughly linearly with how many tokens you're holding. A model that fits comfortably at 4K context can need gigabytes more at 32K or 128K — and that extra is what tips a card that was previously "just fitting" into OOM. This is the classic Tight fit: fine on paper, no headroom for context.

How to fix it

1. Leave VRAM headroom — size weights AND context together Most common fix

The mistake is budgeting only for the weights. You need room for the weights plus the KV cache at the context length you actually use. The reliable way to avoid this is to pick a model/quant that leaves headroom on your card for the context you need — a "Tight" fit with no room for KV cache is exactly what OOMs on long inputs. Check what fits *with* your target context, not just the bare weights.

Check what fits your hardware — check what fits with your target context length, not just the weights
Open the VRAM checker →

2. Cap the context length to what you need

If you don't need a 128K window, don't allocate one. Setting the context to the size of your actual inputs shrinks the KV cache directly. This is the single most effective knob.

# llama.cpp: cap the context
./llama-cli -m model.gguf -c 8192

# Ollama:
OLLAMA_CONTEXT_LENGTH=8192 ollama run <model>

3. Quantize the KV cache

llama.cpp can store the KV cache at lower precision (e.g. q8_0 instead of f16), roughly halving its memory for a small quality cost. Useful when you genuinely need a long context on a tight card.

./llama-cli -m model.gguf -c 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0

4. Use a smaller-weight model to free room for context

If long context is non-negotiable, a smaller or more aggressively quantized model frees the VRAM the KV cache needs. Trading a little weight precision for a lot of usable context is often the right call for document work.

A model that fits most setups:

View model & requirements →

Frequently asked questions

Why does my model OOM only on long prompts?

Because the KV cache grows with context length. Short prompts use little of it; long documents or chats need much more, and that extra memory is what pushes a tightly-fitting model over its VRAM limit.

How much VRAM does context use?

It depends on the model, but it scales roughly linearly with token count and can reach several gigabytes at very long contexts. That is why you should budget for weights plus context together, not weights alone.

Does quantizing the KV cache hurt quality?

Q8 KV cache is close to lossless in practice and roughly halves cache memory. More aggressive cache quantization can affect long-context coherence, so q8_0 is the usual sweet spot.