Apple Silicon: model won't use the GPU / Metal not engaged

Name: LLM Configurator — GPU VRAM Checker
Author: LLM Configurator

作者： Jakub Rusinowski · 最后更新： 2026年6月15日

Founder, LLM Configurator — AI educator & workshop leader on local LLM deployment

The error

ggml_metal_init: failed to allocate buffer

When you see it

On an M1/M2/M3/M4 Mac the model either runs slowly with no GPU activity in Activity Monitor, or it dies with a Metal buffer allocation error. You expected the Apple GPU to do the work and it isn't — or it tried and ran out of room.

What's actually going on

Apple Silicon shares one pool of unified memory between the CPU and GPU, and llama.cpp/Ollama use Apple's Metal backend to run on the GPU. Two things go wrong: either Metal isn't engaged (an x86 build running under Rosetta, or zero GPU layers), so everything falls to the CPU; or Metal *is* engaged but the model plus context exceeds the slice of unified memory macOS will hand the GPU, so the buffer allocation fails. Either way it ties back to size — big models crowd a shared memory pool.

How to fix it

1. Pick a model sized for your unified memory Most common fix

On a Mac your "VRAM" is a portion of total unified memory, and macOS won't give all of it to the GPU. An 8 GB Mac comfortably runs 3B–7B models at Q4; 16 GB opens up larger ones; 32 GB+ handles the big ones. Matching the model to your memory is what keeps Metal allocations from failing. Check what your specific Mac can hold before downloading.

Check what fits your hardware — see which models fit your Mac’s unified memory
Open the VRAM checker →

2. Make sure you are running a native arm64 build

An Intel build of llama.cpp running under Rosetta will not use Metal — it runs on the CPU only. Confirm your binary is native arm64. If you built from source, build with Metal enabled.

# should print "arm64", not "x86_64"
file $(which ollama)
uname -m

3. Push layers onto the GPU (llama.cpp)

With the Metal build, offload all layers to the GPU. On llama.cpp that's -ngl set high (or -1). Ollama enables Metal automatically, so if it's still on CPU there, suspect a Rosetta/x86 install.

./llama-cli -m model.gguf -ngl 999

4. Lower context if the Metal buffer fails

If Metal is engaged but you hit the allocation error, the KV cache for a long context is likely tipping you over the unified-memory slice. Reduce the context length and retry.

A model that fits most setups:

View model & requirements →

Frequently asked questions

Does my Mac's GPU have separate VRAM?

No. Apple Silicon uses unified memory shared by the CPU and GPU. Your effective "VRAM" is a portion of total RAM, and macOS reserves some for the system — so a 16 GB Mac does not give all 16 GB to the model.

How do I confirm the model is using the GPU on a Mac?

Open Activity Monitor → Window → GPU History (or run sudo powermetrics --samplers gpu_power) while generating. Real GPU activity means Metal is engaged; a flat line means it is running on the CPU.

Why does the same model run on my friend’s Mac but not mine?

Almost always unified-memory size. A 32 GB or 64 GB Mac can hold models that simply do not fit on an 8 GB or 16 GB machine. Pick a model matched to your memory rather than theirs.