Back to library

🧮Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF

Stop picking quantization formats from Reddit threads. You'll separate algorithm, file format, and runtime kernel into three clean decisions — then justify any pick for Ollama, vLLM, or a single 4090.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1The Three-Layer Stack Hidden Inside Every Quant

Untangle algorithm, file format, and runtime kernel

4 drops
  1. A quant format is three decisions, not one

    7 min

    A quant format is three decisions, not one

  2. 4-bit weights changed local LLM economics

    6 min

    4-bit weights changed local LLM economics

  3. GPTQ rounds carefully. AWQ rounds the important weights.

    7 min

    GPTQ rounds carefully. AWQ rounds the important weights.

  4. GGUF is a container, not an algorithm

    7 min

    GGUF is a container, not an algorithm

Phase 2Running the Same Model Three Different Ways

Benchmark GGUF, AWQ, and EXL2 on identical prompts

5 drops
  1. Three runtimes, one model — set the experiment

    6 min

    Three runtimes, one model — set the experiment

  2. Run Llama-3-8B in GGUF Q4_K_M with llama.cpp

    8 min

    Run Llama-3-8B in GGUF Q4_K_M with llama.cpp

  3. Run the same model AWQ-quantized on vLLM

    8 min

    Run the same model AWQ-quantized on vLLM

  4. Run the same model EXL2 on ExLlamaV2

    8 min

    Run the same model EXL2 on ExLlamaV2

  5. Three rows in your table — what they actually mean

    7 min

    Three rows in your table — what they actually mean

Phase 3Why Each Format Wins Its Battle

Trace why each format wins different battles

4 drops
  1. The Mac developer who thought AWQ would help

    7 min

    The Mac developer who thought AWQ would help

  2. The H100 fleet that should not run GGUF

    8 min

    The H100 fleet that should not run GGUF

  3. The 4090 user who needs Llama-70B to fit

    8 min

    The 4090 user who needs Llama-70B to fit

  4. The model card that says AWQ — but ships Marlin-ready bytes

    8 min

    The model card that says AWQ — but ships Marlin-ready bytes

Phase 4Recommending Stacks You Can Defend

Recommend and justify formats for real deployments

1 drop
  1. Build your three-deployment recommendation memo

    8 min

    Build your three-deployment recommendation memo

Frequently asked questions

What's the actual difference between GPTQ, AWQ, EXL2, and GGUF?
This is covered in the “Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Which quantization format is fastest on an RTX 4090?
This is covered in the “Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Can GGUF run on GPU or is it CPU-only?
This is covered in the “Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why is Marlin-AWQ so much faster than naive AWQ?
This is covered in the “Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Should I use Q4_K_M or Q5_K_M for local Llama-3-8B?
This is covered in the “Choose a Quantization Format: GPTQ vs AWQ vs EXL2 vs GGUF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.