🧭Understand Alignment as a Research Problem
Treat AI alignment as a research field with concrete open problems — outer vs inner alignment, deceptive alignment, and scalable oversight — instead of vibes about doom or guardrails. Walk away able to write a one-paragraph map of the alignment landscape that holds up to a skeptical reader.
Phase 1What 'Alignment' Actually Means
See why specifying human values is itself unsolved
Alignment isn't 'AI safety' — it's the specification problem
6 minAlignment isn't 'AI safety' — it's the specification problem
Specifying human values is the unsolved part
7 minSpecifying human values is the unsolved part
A model can learn the wrong goal and still pass training
7 minA model can learn the wrong goal and still pass training
Alignment used to be theoretical — capability made it concrete
6 minAlignment used to be theoretical — capability made it concrete
Phase 2Outer, Inner, and Deceptive Alignment
Walk through outer, inner, and deceptive alignment with toy cases
Outer alignment: did you write down the right objective?
6 minOuter alignment: did you write down the right objective?
Inner alignment: is the model pursuing the objective you trained?
7 minInner alignment: is the model pursuing the objective you trained?
Walk through CoinRun: a tiny, complete inner-misalignment story
7 minWalk through CoinRun: a tiny, complete inner-misalignment story
Deceptive alignment: looks aligned in training, isn't
8 minDeceptive alignment: looks aligned in training, isn't
Build a clean three-bucket threat model in your head
6 minBuild a clean three-bucket threat model in your head
Phase 3Scalable Oversight and Real-World Application
Map scalable oversight, debate, and weak-to-strong generalization
You're a senior reviewer and the model is smarter than you are. Now what?
7 minYou're a senior reviewer and the model is smarter than you are. Now what?
Two superhuman models argue. You judge. Does that work?
8 minTwo superhuman models argue. You judge. Does that work?
You can only train with a weaker teacher. The student is more capable. Is the student actually aligned?
8 minYou can only train with a weaker teacher. The student is more capable. Is the student actually aligned?
An AI helps oversee another AI. Where does the trust bottom out?
8 minAn AI helps oversee another AI. Where does the trust bottom out?
Phase 4Map the Landscape in One Paragraph
Write a one-paragraph alignment landscape that survives critique
Write the alignment landscape in one paragraph that survives critique
15 minWrite the alignment landscape in one paragraph that survives critique
Frequently asked questions
- What is the AI alignment problem in plain language?
- This is covered in the “Understand Alignment as a Research Problem” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is the difference between outer alignment and inner alignment?
- This is covered in the “Understand Alignment as a Research Problem” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What does deceptive alignment mean and why is it hard to test for?
- This is covered in the “Understand Alignment as a Research Problem” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is scalable oversight and why do we need it?
- This is covered in the “Understand Alignment as a Research Problem” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Is AI alignment the same thing as AI safety?
- This is covered in the “Understand Alignment as a Research Problem” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.