Model loads but runs painfully slow (it is on your CPU, not your GPU)

Name: LLM Configurator — GPU VRAM Checker
Author: LLM Configurator

作者： Jakub Rusinowski · 最后更新： 2026年6月15日

Founder, LLM Configurator — AI educator & workshop leader on local LLM deployment

The error

llm_load_tensors: offloaded 0/33 layers to GPU

When you see it

The model loads fine and answers correctly — but it crawls, a few tokens per second or worse, with your CPU fans roaring and the GPU sitting idle. The giveaway is in the load log: a line saying only some (or none) of the layers went to the GPU.

What's actually going on

Tokens are generated fast only when the whole model lives in VRAM. When it does not fit, llama.cpp / Ollama keeps the overflow layers in system RAM and runs them on the CPU. That spillover is the slowness: CPU inference and the PCIe round-trips are an order of magnitude slower than staying on the card. So this is really an out-of-memory problem wearing a disguise — the model didn't crash, it just demoted itself to the CPU.

How to fix it

1. Use a model or quant that fully fits your VRAM Most common fix

If even a few layers are offloaded, you are leaving most of your speed on the table. The fix that actually makes it fast is picking a size/quant that fits entirely on the GPU — then 100% of the layers load and tokens/sec jumps back to where it should be. Check exactly which quant of your model fits your card so you can get full-GPU speed instead of a CPU crawl.

Check what fits your hardware — find a model + quant that fits fully on your GPU for full speed
Open the VRAM checker →

2. Confirm the GPU is actually being used

First make sure the runtime even sees your GPU. Watch utilisation while generating — if the GPU stays at 0%, the model is fully on CPU (a build/driver problem) rather than just spilling over.

# NVIDIA: watch GPU use live while the model generates
nvidia-smi -l 1

3. Force more layers onto the GPU (llama.cpp)

If you are close to fitting, push as many layers as possible to the card with --n-gpu-layers. Setting it to a high number (or -1 for all) maximises what lives in VRAM. If you then OOM, you were over budget — step down a quant.

./llama-cli -m model.gguf --n-gpu-layers -1

4. Make sure you installed a GPU-enabled build

A CPU-only build of llama.cpp will *never* touch the GPU no matter what flags you pass. If you compiled from source, rebuild with the right backend (CUDA for NVIDIA, Metal for Mac, ROCm/HIP for AMD). Ollama ships GPU support by default, but a stale install can lose it after an OS/driver update.

A model that fits most setups:

View model & requirements →

Frequently asked questions

How do I know if my model is running on CPU or GPU?

Check the load log for an "offloaded N/M layers to GPU" line — if N is less than M, part of it is on the CPU. Or watch nvidia-smi (NVIDIA) / Activity Monitor GPU history (Mac) while generating; a busy GPU means it is being used.

Why is partial GPU offload so much slower than full?

Every token has to pass through the CPU-resident layers and cross the PCIe bus, and the CPU layers run far slower than the GPU. One slow stage bottlenecks the whole pipeline, so even a little spillover tanks your tokens/sec.

Is a slow-but-working model better than a smaller one?

For interactive use, a smaller model that runs fully on the GPU usually beats a larger one crawling on the CPU — the quality gap is small and the speed gap is enormous. Try the well-fitting smaller model before settling for the crawl.