Written by Jakub Rusinowski · Last updated June 15, 2026
Founder, LLM Configurator — AI educator & workshop leader on local LLM deployment
Which GGUF quant should I download?
You found the model on Hugging Face and the repo has a dozen .gguf files: Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0, and more. They range from tiny to huge and the names mean nothing yet. You just want to know which one to click.
Quantization is how much the model's weights are compressed. Lower numbers (Q2, Q3) are smaller and faster but lose accuracy; higher numbers (Q6, Q8) are closer to the original but bigger and need more VRAM. There's no single "best" file — the right quant is the largest one that still fits your hardware with headroom for context. So the question "which quant?" is really "how much VRAM do I have?"
The decision rule is simple: choose the highest quant your card can hold with a little room to spare for context. For most people that lands on Q4_K_M — it's the community default sweet spot, small enough to fit common cards and close enough to full precision that you won't notice the difference in chat, coding, or summarisation. Go higher (Q5_K_M, Q6_K) only if you have spare VRAM; go lower (Q3, Q2) only if Q4 won't fit. The fastest way to land on the right file is to check which quant of your model fits your exact card.
Q8_0 — near-lossless, large, only if you have lots of VRAM. Q6_K — excellent quality, a good step down. Q5_K_M — high quality, modest size. Q4_K_M — the recommended default; best balance for most setups. Q3_K_M — noticeably softer, for tight memory. Q2_K — smallest, clear quality loss; last resort. The _K_M variants are the modern k-quants and are preferred over the older plain Q4_0 style at the same size.
Casual chat tolerates lower quants well. Coding, structured output, and tasks needing precision benefit from Q5/Q6 if you can fit them. If two quants both fit, the bigger one is the safer pick for demanding work.
Q4_K_M. It is the community default because it balances size, speed, and quality — it fits common GPUs and is very close to the full model for everyday use. Move up only if you have VRAM to spare.
Both are roughly 4-bit, but the _K_M k-quants distribute precision more intelligently across the model, giving better quality at a similar size. Prefer the _K_M variants when available.
Often yes — a larger model at Q4 usually beats a much smaller model at Q8 of similar total size. But both must fit your VRAM with room for context, which is the real constraint to check first.
Compare the file size and the model’s memory needs against your VRAM, leaving headroom for the KV cache. Rather than guessing from file sizes, check your card against the model to see which quants fit.