š§®Understand vLLM PagedAttention and KV Cache Memory
Re-use the virtual-memory analogy you already know to demystify vLLM: by the end you can sketch a block table, explain prefix sharing, and estimate how many 8k-context sequences fit on your GPU.
Phase 1The Hidden Memory Bug in LLM Serving
See where 60-80% of KV cache memory vanishes
PagedAttention is paging, not attention math
5 minPagedAttention is paging, not attention math
Pre-vLLM systems waste 60-80% of KV cache memory
6 minPre-vLLM systems waste 60-80% of KV cache memory
The KV cache is a heap; paging is a malloc rewrite
6 minThe KV cache is a heap; paging is a malloc rewrite
A block is 16 tokens of K and V, fixed size
5 minA block is 16 tokens of K and V, fixed size
Phase 2Walking Through the Block Table
Build the block table and trace copy-on-write
The block table maps logical to physical, one row per sequence
6 minThe block table maps logical to physical, one row per sequence
Sequences grow one block at a time, not in chunks
6 minSequences grow one block at a time, not in chunks
Two requests with the same system prompt share blocks
7 minTwo requests with the same system prompt share blocks
Divergence triggers copy-on-write at block granularity
7 minDivergence triggers copy-on-write at block granularity
When VRAM fills up, vLLM swaps blocks to CPU
7 minWhen VRAM fills up, vLLM swaps blocks to CPU
Phase 3Paging Meets Batching and Prefix Caching
Connect paging to batching and prefix caching
Your latency dashboard shows variance, not improvement
7 minYour latency dashboard shows variance, not improvement
Your chatbot's TTFT mysteriously dropped overnight
8 minYour chatbot's TTFT mysteriously dropped overnight
Why a 128k model still chokes at 100 users
8 minWhy a 128k model still chokes at 100 users
INT8 KV cache halves your memory bill, not your block count
8 minINT8 KV cache halves your memory bill, not your block count
Phase 4Sizing Your GPU for Real Workloads
Estimate concurrent sequences for your real GPU
Estimate concurrent 8k-context sequences for your GPU
18 minEstimate concurrent 8k-context sequences for your GPU
Frequently asked questions
- What is PagedAttention in vLLM and how is it different from FlashAttention?
- This is covered in the āUnderstand vLLM PagedAttention and KV Cache Memoryā learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does the KV cache waste so much GPU memory without paging?
- This is covered in the āUnderstand vLLM PagedAttention and KV Cache Memoryā learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How does vLLM share a system prompt across multiple requests?
- This is covered in the āUnderstand vLLM PagedAttention and KV Cache Memoryā learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is a block table in vLLM and how does it map logical to physical blocks?
- This is covered in the āUnderstand vLLM PagedAttention and KV Cache Memoryā learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do I estimate the maximum concurrent sequences my GPU can serve?
- This is covered in the āUnderstand vLLM PagedAttention and KV Cache Memoryā learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
šPython Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking ā then ship a working caching or logging decorator from scratch in under 30 lines.
š¦Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic ā one failing snippet at a time ā until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
āøļøKubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
šBig O Intuition
Stop treating Big O as math you memorized for an interview ā build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.