← All troubleshooting guides

CUDA out of memory — why it happens and how to fix it

Written by Jakub Rusinowski · Last updated June 15, 2026

Founder, LLM Configurator — AI educator & workshop leader on local LLM deployment

The error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 8.00 GiB total capacity; 7.21 GiB already allocated)

When you see it

The model starts loading, the GPU fans spin up, and then it dies — usually the moment the weights or the first batch hit the card. You see it in PyTorch, in text-generation-webui, in vLLM, and under the hood of a lot of "it just crashed" reports.

What's actually going on

There is no mystery here: the model needs more GPU memory than your card has. The number after "Tried to allocate" plus what is "already allocated" is larger than your "total capacity". That last figure is your real VRAM ceiling, and on most consumer cards it is 8, 12, 16, or 24 GB. Model weights, the KV cache for your context, and PyTorch's own overhead all draw from that same pool.

How to fix it

1. Use a smaller model or a lower quantization — this is the real fix Most common fix

Nine times out of ten the honest answer is that the model is simply too big for the card. A 7B model in full FP16 wants roughly 14 GB just for weights; the same model at Q4_K_M wants closer to 4–5 GB and is nearly indistinguishable in quality for most work. Dropping a quant level, or stepping down a model size, fixes this permanently instead of papering over it. The fastest way to know exactly what fits is to check your card against the model library rather than guessing.

Check what fits your hardware — check which quant of your model fits your exact VRAM
Open the VRAM checker →

2. Free whatever is already holding VRAM

Look at the already allocated figure — if it is high before you have even loaded your model, something else is squatting on the card: a previous run that did not exit cleanly, a Jupyter kernel, a browser using the GPU, or a second process. Check what is resident and kill the stragglers.

nvidia-smi
# find the PID holding memory, then:
kill <PID>

3. Lower the batch size and the context length

If you are training or running batched inference, the batch is often the thing that tips you over. Set batch size to 1 first and confirm it loads at all. Long context is the other silent memory hog — the KV cache grows with sequence length, so a model that loads fine at 2K tokens can OOM at 32K. Trim max_seq_len / --ctx-size to what you actually need.

4. Reduce memory fragmentation (PyTorch)

Sometimes you have *enough* total VRAM but it is fragmented, so a single large allocation fails. Telling the allocator it can grow its segments often recovers that headroom. Set this before launching your script:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

5. Offload some layers to CPU (llama.cpp / Ollama)

If you are on GGUF tooling and just slightly over budget, you can push a portion of the layers to system RAM with --n-gpu-layers. It is slower than running fully on the GPU, but it is the difference between running and not running. Note this is a fallback, not a free lunch — if you are offloading most of the model, you want a smaller one.

# llama.cpp: keep 20 layers on the GPU, the rest on CPU
./llama-cli -m model.gguf --n-gpu-layers 20
A model that fits most setups:
View model & requirements →

Frequently asked questions

Does "CUDA out of memory" mean my GPU is broken?

No. It is a normal, expected error that simply means the model needed more VRAM than the card has free. The hardware is fine — the model is too big for it at the size or quantization you chose.

How much VRAM do I need to avoid CUDA out of memory?

It depends entirely on the model and quantization. As a rough guide, a 7–8B model at Q4 wants about 5–6 GB, a 13–14B model about 9–10 GB, and a 70B model needs 40 GB+ or aggressive quantization. Check your specific model and card to get an exact figure rather than a rule of thumb.

Will lowering the quantization hurt quality?

Far less than people expect. Q4_K_M and Q5_K_M are the sweet spot for most local use — the quality drop versus FP16 is small and usually not noticeable in chat, coding, or summarisation. Going below Q4 (Q3, Q2) is where degradation becomes obvious.

I set expandable_segments and it still crashes. Now what?

Then you genuinely do not have enough VRAM, and the fragmentation tweak cannot conjure memory that is not there. Step down a model size or a quant level — that is the durable fix.