作者: Jakub Rusinowski · 最后更新: 2026年6月15日
Founder, LLM Configurator — AI educator & workshop leader on local LLM deployment
llm_load_tensors: offloaded 0/33 layers to GPU
The model loads fine and answers correctly — but it crawls, a few tokens per second or worse, with your CPU fans roaring and the GPU sitting idle. The giveaway is in the load log: a line saying only some (or none) of the layers went to the GPU.
Tokens are generated fast only when the whole model lives in VRAM. When it does not fit, llama.cpp / Ollama keeps the overflow layers in system RAM and runs them on the CPU. That spillover is the slowness: CPU inference and the PCIe round-trips are an order of magnitude slower than staying on the card. So this is really an out-of-memory problem wearing a disguise — the model didn't crash, it just demoted itself to the CPU.
If even a few layers are offloaded, you are leaving most of your speed on the table. The fix that actually makes it fast is picking a size/quant that fits entirely on the GPU — then 100% of the layers load and tokens/sec jumps back to where it should be. Check exactly which quant of your model fits your card so you can get full-GPU speed instead of a CPU crawl.
First make sure the runtime even sees your GPU. Watch utilisation while generating — if the GPU stays at 0%, the model is fully on CPU (a build/driver problem) rather than just spilling over.
# NVIDIA: watch GPU use live while the model generates
nvidia-smi -l 1
If you are close to fitting, push as many layers as possible to the card with --n-gpu-layers. Setting it to a high number (or -1 for all) maximises what lives in VRAM. If you then OOM, you were over budget — step down a quant.
./llama-cli -m model.gguf --n-gpu-layers -1
A CPU-only build of llama.cpp will *never* touch the GPU no matter what flags you pass. If you compiled from source, rebuild with the right backend (CUDA for NVIDIA, Metal for Mac, ROCm/HIP for AMD). Ollama ships GPU support by default, but a stale install can lose it after an OS/driver update.
Check the load log for an "offloaded N/M layers to GPU" line — if N is less than M, part of it is on the CPU. Or watch nvidia-smi (NVIDIA) / Activity Monitor GPU history (Mac) while generating; a busy GPU means it is being used.
Every token has to pass through the CPU-resident layers and cross the PCIe bus, and the CPU layers run far slower than the GPU. One slow stage bottlenecks the whole pipeline, so even a little spillover tanks your tokens/sec.
For interactive use, a smaller model that runs fully on the GPU usually beats a larger one crawling on the CPU — the quality gap is small and the speed gap is enormous. Try the well-fitting smaller model before settling for the crawl.