🧠Compare GQA, MQA, and Multi-Head Attention
GQA isn't a new mechanism — it's a single knob (G) that trades KV-cache memory for quality on top of plain attention. You'll learn to pick G for a real serving budget by walking the cache-size math and the quality argument side by side.
Phase 1Why the KV Cache Eats Your GPU
See why KV cache, not FLOPs, ends long-context inference
FLOPs aren't the bottleneck — your KV cache is
7 minAutoregressive inference is memory-bandwidth bound, not compute bound. The KV cache, not the attention matmul, is what runs you out of GPU.
The whole formula fits on a Post-it
6 minKV cache bytes = 2 · L · H · d_head · seq · batch · dtype_bytes. Every attention variant is just changing the H term.
Fewer KV heads is the cheapest cache fix
6 minOf all the things you could change about attention, reducing the KV-head count gives the biggest cache cut per unit of quality risk.
GQA isn't a new mechanism — it's a slider
6 minGQA is the same attention you already know. The only difference is one integer G — the number of KV groups — which slides from 1 (MQA) to H (MHA).
Phase 2Drawing the Three Layouts
Draw MHA, MQA, and GQA head layouts from memory
Draw MHA before you draw anything else
6 minMulti-head attention is H parallel attention heads, each with its own Q, K, and V projections. Every K/V is private. That's the baseline GQA modifies.
Draw MQA next — H queries, one KV
6 minMulti-query attention keeps H Q heads but collapses K and V to a single shared head. Cache shrinks by exactly H×.
GQA sits in the middle — H queries, G KVs
7 minGQA groups Q heads into G groups, each sharing one K and one V head. G is a free parameter from 1 (MQA) to H (MHA).
Compute cache size in your head, three variants
7 minPractice the formula until cache size becomes mental arithmetic for any model spec, any context, any batch.
Spot the GQA in any model's config file
6 minEvery modern open model declares its head layout in one or two config keys. Learn the names and you can read any architecture at a glance.
Phase 3Why Real Models Picked What They Did
Read why Llama, Mistral, and Falcon picked different G
Your team wants MQA — should you push back?
7 minYour team wants MQA — should you push back?
Meta picked G=8 for Llama-2 70B — why not 4? Why not 16?
7 minMeta picked G=8 for Llama-2 70B — why not 4? Why not 16?
Mistral 7B picked GQA-8 — and added a second knob
7 minMistral 7B picked GQA-8 — and added a second knob
Falcon-7B shipped MQA — how did that bet age?
7 minFalcon-7B shipped MQA — how did that bet age?
Phase 4Pick a G for Your 13B Serving Job
Pick a G for a 13B model on one 80GB GPU
Pick G for a 13B serving 64k context at batch 32 on one 80GB GPU
18 minPick G for a 13B serving 64k context at batch 32 on one 80GB GPU
Frequently asked questions
- What is the difference between GQA, MQA, and multi-head attention?
- This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does GQA reduce memory but not quality the way MQA does?
- This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do I pick the number of KV groups (G) for grouped query attention?
- This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why is the KV cache the bottleneck for LLM inference at long context?
- This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Which open-source models use GQA and what value of G did they choose?
- This is covered in the “Compare GQA, MQA, and Multi-Head Attention” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.