Back to library

🧠Compare GQA, MQA, and Multi-Head Attention

GQA isn't a new mechanism — it's a single knob (G) that trades KV-cache memory for quality on top of plain attention. You'll learn to pick G for a real serving budget by walking the cache-size math and the quality argument side by side.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why the KV Cache Eats Your GPU

See why KV cache, not FLOPs, ends long-context inference

4 drops
  1. FLOPs aren't the bottleneck — your KV cache is

    7 min

    Autoregressive inference is memory-bandwidth bound, not compute bound. The KV cache, not the attention matmul, is what runs you out of GPU.

  2. The whole formula fits on a Post-it

    6 min

    KV cache bytes = 2 · L · H · d_head · seq · batch · dtype_bytes. Every attention variant is just changing the H term.

  3. Fewer KV heads is the cheapest cache fix

    6 min

    Of all the things you could change about attention, reducing the KV-head count gives the biggest cache cut per unit of quality risk.

  4. GQA isn't a new mechanism — it's a slider

    6 min

    GQA is the same attention you already know. The only difference is one integer G — the number of KV groups — which slides from 1 (MQA) to H (MHA).

Phase 2Drawing the Three Layouts

Draw MHA, MQA, and GQA head layouts from memory

5 drops
  1. Draw MHA before you draw anything else

    6 min

    Multi-head attention is H parallel attention heads, each with its own Q, K, and V projections. Every K/V is private. That's the baseline GQA modifies.

  2. Draw MQA next — H queries, one KV

    6 min

    Multi-query attention keeps H Q heads but collapses K and V to a single shared head. Cache shrinks by exactly H×.

  3. GQA sits in the middle — H queries, G KVs

    7 min

    GQA groups Q heads into G groups, each sharing one K and one V head. G is a free parameter from 1 (MQA) to H (MHA).

  4. Compute cache size in your head, three variants

    7 min

    Practice the formula until cache size becomes mental arithmetic for any model spec, any context, any batch.

  5. Spot the GQA in any model's config file

    6 min

    Every modern open model declares its head layout in one or two config keys. Learn the names and you can read any architecture at a glance.

Phase 3Why Real Models Picked What They Did

Read why Llama, Mistral, and Falcon picked different G

4 drops
  1. Your team wants MQA — should you push back?

    7 min

    Your team wants MQA — should you push back?

  2. Meta picked G=8 for Llama-2 70B — why not 4? Why not 16?

    7 min

    Meta picked G=8 for Llama-2 70B — why not 4? Why not 16?

  3. Mistral 7B picked GQA-8 — and added a second knob

    7 min

    Mistral 7B picked GQA-8 — and added a second knob

  4. Falcon-7B shipped MQA — how did that bet age?

    7 min

    Falcon-7B shipped MQA — how did that bet age?

Phase 4Pick a G for Your 13B Serving Job

Pick a G for a 13B model on one 80GB GPU

1 drop
  1. Pick G for a 13B serving 64k context at batch 32 on one 80GB GPU

    18 min

    Pick G for a 13B serving 64k context at batch 32 on one 80GB GPU

Frequently asked questions

What is the difference between GQA, MQA, and multi-head attention?
This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does GQA reduce memory but not quality the way MQA does?
This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I pick the number of KV groups (G) for grouped query attention?
This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why is the KV cache the bottleneck for LLM inference at long context?
This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Which open-source models use GQA and what value of G did they choose?
This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.