🖼️Understand Vision Transformers (ViT)
Walk one 224x224 image through patching, embedding, and attention until ViT stops feeling like a magic trick — then predict where the heads attend on a cat-and-person photo before the demo confirms it.
Phase 1Why Vision Became Sequence Modeling
See why CNNs ceiling out on global context
CNNs see locally on purpose — and that's the bug at scale
6 minCNNs see locally on purpose — and that's the bug at scale
An image is worth 16x16 words, literally
6 minAn image is worth 16x16 words, literally
Without position embeddings, ViT can't tell top from bottom
6 minWithout position embeddings, ViT can't tell top from bottom
One token rules them all: the CLS token
6 minOne token rules them all: the CLS token
Phase 2Patch, Embed, Attend — by Hand
Patch a 224x224 image and trace one attention layer
Slice the image like a contact sheet
7 minSlice the image like a contact sheet
Project, then add — patch embedding plus position embedding
6 minProject, then add — patch embedding plus position embedding
Every patch asks every patch: 'how related are we?'
7 minEvery patch asks every patch: 'how related are we?'
Twelve heads, twelve different ways to read the image
6 minTwelve heads, twelve different ways to read the image
Stack twelve identical blocks and call it a day
7 minStack twelve identical blocks and call it a day
Phase 3ViT vs CNN: Data, Bias, and Hybrids
Weigh ViT vs CNN data hunger and inductive bias
A small ViT loses to a small CNN on a small dataset
7 minA small ViT loses to a small CNN on a small dataset
Choose your prior: built-in or learned
7 minChoose your prior: built-in or learned
Why Swin and ConvNeXt look like a synthesis, not a regression
7 minWhy Swin and ConvNeXt look like a synthesis, not a regression
A one-page mental map for picking your next backbone
7 minA one-page mental map for picking your next backbone
Phase 4Predict and Verify Attention on a Real Photo
Predict ViT attention on a cat-and-person photo
Sketch where ViT looks at a cat with a person — then check
18 minSketch where ViT looks at a cat with a person — then check
Frequently asked questions
- What does 'an image is worth 16x16 words' actually mean?
- This is covered in the “Understand Vision Transformers (ViT)” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why do Vision Transformers need so much more training data than CNNs?
- This is covered in the “Understand Vision Transformers (ViT)” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do you turn a 224x224 image into tokens for a transformer?
- This is covered in the “Understand Vision Transformers (ViT)” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is the CLS token in a Vision Transformer and what does it do?
- This is covered in the “Understand Vision Transformers (ViT)” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why did hybrid architectures like Swin and ConvNeXt come back after ViT?
- This is covered in the “Understand Vision Transformers (ViT)” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.