Written by Jakub Rusinowski · Last updated June 15, 2026
Founder, LLM Configurator — AI educator & workshop leader on local LLM deployment
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 8.00 GiB total capacity; 7.21 GiB already allocated)
The model starts loading, the GPU fans spin up, and then it dies — usually the moment the weights or the first batch hit the card. You see it in PyTorch, in text-generation-webui, in vLLM, and under the hood of a lot of "it just crashed" reports.
There is no mystery here: the model needs more GPU memory than your card has. The number after "Tried to allocate" plus what is "already allocated" is larger than your "total capacity". That last figure is your real VRAM ceiling, and on most consumer cards it is 8, 12, 16, or 24 GB. Model weights, the KV cache for your context, and PyTorch's own overhead all draw from that same pool.
Nine times out of ten the honest answer is that the model is simply too big for the card. A 7B model in full FP16 wants roughly 14 GB just for weights; the same model at Q4_K_M wants closer to 4–5 GB and is nearly indistinguishable in quality for most work. Dropping a quant level, or stepping down a model size, fixes this permanently instead of papering over it. The fastest way to know exactly what fits is to check your card against the model library rather than guessing.
Look at the already allocated figure — if it is high before you have even loaded your model, something else is squatting on the card: a previous run that did not exit cleanly, a Jupyter kernel, a browser using the GPU, or a second process. Check what is resident and kill the stragglers.
nvidia-smi
# find the PID holding memory, then:
kill <PID>
If you are training or running batched inference, the batch is often the thing that tips you over. Set batch size to 1 first and confirm it loads at all. Long context is the other silent memory hog — the KV cache grows with sequence length, so a model that loads fine at 2K tokens can OOM at 32K. Trim max_seq_len / --ctx-size to what you actually need.
Sometimes you have *enough* total VRAM but it is fragmented, so a single large allocation fails. Telling the allocator it can grow its segments often recovers that headroom. Set this before launching your script:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
If you are on GGUF tooling and just slightly over budget, you can push a portion of the layers to system RAM with --n-gpu-layers. It is slower than running fully on the GPU, but it is the difference between running and not running. Note this is a fallback, not a free lunch — if you are offloading most of the model, you want a smaller one.
# llama.cpp: keep 20 layers on the GPU, the rest on CPU
./llama-cli -m model.gguf --n-gpu-layers 20
No. It is a normal, expected error that simply means the model needed more VRAM than the card has free. The hardware is fine — the model is too big for it at the size or quantization you chose.
It depends entirely on the model and quantization. As a rough guide, a 7–8B model at Q4 wants about 5–6 GB, a 13–14B model about 9–10 GB, and a 70B model needs 40 GB+ or aggressive quantization. Check your specific model and card to get an exact figure rather than a rule of thumb.
Far less than people expect. Q4_K_M and Q5_K_M are the sweet spot for most local use — the quality drop versus FP16 is small and usually not noticeable in chat, coding, or summarisation. Going below Q4 (Q3, Q2) is where degradation becomes obvious.
Then you genuinely do not have enough VRAM, and the fragmentation tweak cannot conjure memory that is not there. Step down a model size or a quant level — that is the durable fix.