作者: Jakub Rusinowski · 最后更新: 2026年6月15日
Founder, LLM Configurator — AI educator & workshop leader on local LLM deployment
ggml_metal_init: failed to allocate buffer
On an M1/M2/M3/M4 Mac the model either runs slowly with no GPU activity in Activity Monitor, or it dies with a Metal buffer allocation error. You expected the Apple GPU to do the work and it isn't — or it tried and ran out of room.
Apple Silicon shares one pool of unified memory between the CPU and GPU, and llama.cpp/Ollama use Apple's Metal backend to run on the GPU. Two things go wrong: either Metal isn't engaged (an x86 build running under Rosetta, or zero GPU layers), so everything falls to the CPU; or Metal *is* engaged but the model plus context exceeds the slice of unified memory macOS will hand the GPU, so the buffer allocation fails. Either way it ties back to size — big models crowd a shared memory pool.
On a Mac your "VRAM" is a portion of total unified memory, and macOS won't give all of it to the GPU. An 8 GB Mac comfortably runs 3B–7B models at Q4; 16 GB opens up larger ones; 32 GB+ handles the big ones. Matching the model to your memory is what keeps Metal allocations from failing. Check what your specific Mac can hold before downloading.
An Intel build of llama.cpp running under Rosetta will not use Metal — it runs on the CPU only. Confirm your binary is native arm64. If you built from source, build with Metal enabled.
# should print "arm64", not "x86_64"
file $(which ollama)
uname -m
With the Metal build, offload all layers to the GPU. On llama.cpp that's -ngl set high (or -1). Ollama enables Metal automatically, so if it's still on CPU there, suspect a Rosetta/x86 install.
./llama-cli -m model.gguf -ngl 999
If Metal is engaged but you hit the allocation error, the KV cache for a long context is likely tipping you over the unified-memory slice. Reduce the context length and retry.
No. Apple Silicon uses unified memory shared by the CPU and GPU. Your effective "VRAM" is a portion of total RAM, and macOS reserves some for the system — so a 16 GB Mac does not give all 16 GB to the model.
Open Activity Monitor → Window → GPU History (or run sudo powermetrics --samplers gpu_power) while generating. Real GPU activity means Metal is engaged; a flat line means it is running on the CPU.
Almost always unified-memory size. A 32 GB or 64 GB Mac can hold models that simply do not fit on an 8 GB or 16 GB machine. Pick a model matched to your memory rather than theirs.