Guide · 10 min · May 15, 2026

Choosing a GGUF quantization without lying to yourself

Every model on Hugging Face that comes packaged for llama.cpp arrives in a parade of quantization variants: Q2_K, Q3_K_S, Q4_K_M, Q5_K_M, Q6_K, Q8_0, plus the new IQ-prefixed family. Most users pick one almost at random, run with it for months, and never compare. This post is the explanation I wish I had read before doing exactly that.

What quantization actually does

A large language model is, at runtime, a long list of floating-point numbers. The model is trained in 16-bit precision (sometimes 32-bit for parts of it). Quantization rewrites those numbers in fewer bits. A 4-bit quantization stores each weight in four bits instead of sixteen, which cuts the model size by roughly a factor of four and the memory bandwidth required to feed those weights to the GPU by the same factor. Less data moved means faster inference, smaller files, more parameters per gigabyte of memory.

The cost is precision. A four-bit number cannot represent the same range of values as a sixteen-bit one. The quantization scheme has to decide which weights matter enough to deserve more bits, which can be rounded harder, and how to organise the storage to minimise error. The k-quant family (the schemes with _K in the name) does this with per-block scaling. The IQ family adds importance-aware weighting, often producing better quality at the same bit-rate.

The five quantization levels you should actually know

The GGUF catalogue lists more than thirty variants. The ones that matter for almost every decision are five.

Q4_K_M — the default

Roughly four bits per weight, with the most-used weights stored at slightly higher precision. The size hit versus the original FP16 model is about 75 percent: a 14 GB FP16 model lands around 4 GB. Quality loss against the unquantised baseline is small enough that most users cannot reliably tell the difference in blind side-by-side tests on general chat tasks. If you do not know which quantization to pick, pick this one.

Q5_K_M — when you have headroom

About five bits per weight. The size grows by roughly 25 percent over Q4_K_M, and the quality improvement is real but modest. The difference shows most clearly on tasks that punish numerical instability: chained mathematical reasoning, code that involves precise arithmetic, multi-step logic puzzles. If your machine fits Q5_K_M without spilling out of VRAM, the upgrade is essentially free utility.

Q8_0 — the near-perfect tier

Eight bits per weight. The model size is half of FP16, and quality is statistically indistinguishable from the original in almost every evaluation. The trade-off is that the file is roughly twice as big as Q4_K_M, so the same physical memory holds half as many parameters. Useful when you have plenty of memory and want a defensible baseline for benchmarking or for production work where you cannot afford a quality regression you might not notice.

Q3_K_S — when you have to fit a bigger model

Three bits per weight, simple variant. Used when you want to fit a much larger model into the same memory budget than its 4-bit version would allow. A 70B model in Q3_K_S fits in roughly 32 GB instead of the 42 GB Q4_K_M needs. Quality drops are visible: more hallucination, worse code, sometimes confused chat turn-taking. The right answer is usually “run a smaller model at Q4_K_M instead,” but there are cases where the larger-model effect dominates the quantization noise.

IQ4_XS and friends — the modern alternative

The IQ family applies importance-aware quantization with smaller block sizes. IQ4_XS, in particular, has become a popular replacement for Q4_K_M because it produces models about ten percent smaller at similar quality. The cost is slower inference on some hardware because the decoding is more complex. On Apple Silicon and modern NVIDIA cards the speed difference is small; on older hardware it can be noticeable. Worth trying on a model you already know well, so you can judge the trade-off concretely.

The decision rule that actually works

Here is the rule that holds up across hardware and use cases. Compute how much VRAM you have for the model alone (total VRAM minus around 2 GB for context cache and overhead), then pick the largest quantization that fits with comfortable margin.

If Q8_0 fits, use Q8_0. You have no reason not to.
If only Q5_K_M fits, use Q5_K_M. The quality bump over Q4 is worth the disk space and the memory cost.
If only Q4_K_M fits, use Q4_K_M. This is where most consumer hardware lands, and it is the right answer for the large majority of cases.
If even Q4_K_M does not fit, drop down to a smaller model at Q4_K_M before you drop to Q3 on the bigger one. The smaller model at Q4 almost always outperforms the bigger one at Q3.

Two things the rule omits, and when they matter

First, context length. KV cache memory grows linearly with context size and is not quantized in the same way the weights are. A model that fits in 16 GB of VRAM at 4k context might overflow at 32k. If you plan to use long contexts, leave more headroom than the rule suggests, or look at quantized-cache options like -fa and --kv-q4 flags in recent llama.cpp builds.

Second, speculative decoding. If you pair a small draft model with a large target model, both have to fit in memory. The right quantization choice for the target may change once you account for the draft. The combined memory still has to leave room for the KV cache, and the draft model should usually be one or two quantization-bits below the target to keep its rejection rate sensible.

A test you can run in twenty minutes

Pick a model you use often. Download three GGUF variants from the same uploader (so they share quantization tooling): Q4_K_M, Q5_K_M, Q8_0. Build a small prompt set that represents your actual workload: five prompts is enough. Run them through each model in LM Studio using multi-model chat, or via three terminal windows running llama-cli. Read the outputs side by side.

Most of the time, you will find that you cannot reliably distinguish Q4_K_M from Q8_0 on general chat. You will sometimes spot Q4 making a numerical error that Q8 gets right. On code tasks, the gap widens slightly. If you find a workload where Q4 produces visibly worse output, that is your signal to move up. Otherwise, stay where you are; the disk space and the speed are worth more than the theoretical precision.

Where the IQ family fits into your shelf

IQ quantizations are worth trying once you have a model you use every day, because the size savings compound across re-downloads and the quality at small sizes (IQ3_XXS, IQ2_S) is markedly better than the equivalent k-quants. For your daily-driver model, start with Q4_K_M, run the twenty-minute test against IQ4_XS, keep whichever wins on your prompts. For a model you use occasionally, do not bother; the time to evaluate outweighs the gain.

What this looks like in practice

A 24 GB GPU, the most common configuration above the hobbyist line, comfortably runs a 32B model at Q4_K_M or a 14B at Q8_0. The Q8 14B is, on most tasks, the better choice, because the precision dominates the parameter count benefit at that scale. A 16 GB Apple Silicon Mac handles an 8B at Q5_K_M with plenty of context, or a 14B at Q4_K_M with shorter contexts. A 12 GB consumer GPU runs an 8B at Q5_K_M well, or a 14B at Q4_K_M if you keep context modest.

The point is that the “right” quantization is almost never a single answer. It depends on the model size you are trying to fit, the context lengths you actually use, and the workload you care about. The decision rule above gets you to a sensible default; the twenty-minute test lets you correct it. Anything more elaborate is usually false precision.