What is the difference between GQA, MQA, and multi-head attention?

This is covered in the "Compare GQA, MQA, and Multi-Head Attention" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why does GQA reduce memory but not quality the way MQA does?

This is covered in the "Compare GQA, MQA, and Multi-Head Attention" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How do I pick the number of KV groups (G) for grouped query attention?

This is covered in the "Compare GQA, MQA, and Multi-Head Attention" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why is the KV cache the bottleneck for LLM inference at long context?

This is covered in the "Compare GQA, MQA, and Multi-Head Attention" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Which open-source models use GQA and what value of G did they choose?

This is covered in the "Compare GQA, MQA, and Multi-Head Attention" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

🧠Compare GQA, MQA, and Multi-Head Attention

GQA isn't a new mechanism — it's a single knob (G) that trades KV-cache memory for quality on top of plain attention. You'll learn to pick G for a real serving budget by walking the cache-size math and the quality argument side by side.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why the KV Cache Eats Your GPU

See why KV cache, not FLOPs, ends long-context inference

4 drops

FLOPs aren't the bottleneck — your KV cache is
7 min
Autoregressive inference is memory-bandwidth bound, not compute bound. The KV cache, not the attention matmul, is what runs you out of GPU.
The whole formula fits on a Post-it
6 min
KV cache bytes = 2 · L · H · d_head · seq · batch · dtype_bytes. Every attention variant is just changing the H term.
Fewer KV heads is the cheapest cache fix
6 min
Of all the things you could change about attention, reducing the KV-head count gives the biggest cache cut per unit of quality risk.
GQA isn't a new mechanism — it's a slider
6 min
GQA is the same attention you already know. The only difference is one integer G — the number of KV groups — which slides from 1 (MQA) to H (MHA).

Phase 2Drawing the Three Layouts

Draw MHA, MQA, and GQA head layouts from memory

5 drops

Draw MHA before you draw anything else
6 min
Multi-head attention is H parallel attention heads, each with its own Q, K, and V projections. Every K/V is private. That's the baseline GQA modifies.
Draw MQA next — H queries, one KV
6 min
Multi-query attention keeps H Q heads but collapses K and V to a single shared head. Cache shrinks by exactly H×.
GQA sits in the middle — H queries, G KVs
7 min
GQA groups Q heads into G groups, each sharing one K and one V head. G is a free parameter from 1 (MQA) to H (MHA).
Compute cache size in your head, three variants
7 min
Practice the formula until cache size becomes mental arithmetic for any model spec, any context, any batch.
Spot the GQA in any model's config file
6 min
Every modern open model declares its head layout in one or two config keys. Learn the names and you can read any architecture at a glance.

Phase 3Why Real Models Picked What They Did

Read why Llama, Mistral, and Falcon picked different G

4 drops

Your team wants MQA — should you push back?
7 min
Your team wants MQA — should you push back?
Meta picked G=8 for Llama-2 70B — why not 4? Why not 16?
7 min
Meta picked G=8 for Llama-2 70B — why not 4? Why not 16?
Mistral 7B picked GQA-8 — and added a second knob
7 min
Mistral 7B picked GQA-8 — and added a second knob
Falcon-7B shipped MQA — how did that bet age?
7 min
Falcon-7B shipped MQA — how did that bet age?

Phase 4Pick a G for Your 13B Serving Job

Pick a G for a 13B model on one 80GB GPU

1 drop

Pick G for a 13B serving 64k context at batch 32 on one 80GB GPU
18 min
Pick G for a 13B serving 64k context at batch 32 on one 80GB GPU

Frequently asked questions

What is the difference between GQA, MQA, and multi-head attention?: This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does GQA reduce memory but not quality the way MQA does?: This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I pick the number of KV groups (G) for grouped query attention?: This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why is the KV cache the bottleneck for LLM inference at long context?: This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Which open-source models use GQA and what value of G did they choose?: This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🧠Compare GQA, MQA, and Multi-Head Attention

Phase 1Why the KV Cache Eats Your GPU

FLOPs aren't the bottleneck — your KV cache is

The whole formula fits on a Post-it

Fewer KV heads is the cheapest cache fix

GQA isn't a new mechanism — it's a slider

Phase 2Drawing the Three Layouts

Draw MHA before you draw anything else

Draw MQA next — H queries, one KV

GQA sits in the middle — H queries, G KVs

Compute cache size in your head, three variants

Spot the GQA in any model's config file

Phase 3Why Real Models Picked What They Did

Your team wants MQA — should you push back?

Meta picked G=8 for Llama-2 70B — why not 4? Why not 16?

Mistral 7B picked GQA-8 — and added a second knob

Falcon-7B shipped MQA — how did that bet age?

Phase 4Pick a G for Your 13B Serving Job

Pick G for a 13B serving 64k context at batch 32 on one 80GB GPU

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Why the KV Cache Eats Your GPU

FLOPs aren't the bottleneck — your KV cache is

The whole formula fits on a Post-it

Fewer KV heads is the cheapest cache fix

GQA isn't a new mechanism — it's a slider

Phase 2Drawing the Three Layouts

Draw MHA before you draw anything else

Draw MQA next — H queries, one KV

GQA sits in the middle — H queries, G KVs

Compute cache size in your head, three variants

Spot the GQA in any model's config file

Phase 3Why Real Models Picked What They Did

Your team wants MQA — should you push back?

Meta picked G=8 for Llama-2 70B — why not 4? Why not 16?

Mistral 7B picked GQA-8 — and added a second knob

Falcon-7B shipped MQA — how did that bet age?

Phase 4Pick a G for Your 13B Serving Job

Pick G for a 13B serving 64k context at batch 32 on one 80GB GPU

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition