©️Understand Copyright in AI Training Data
Public web is not 'fair to train on,' and not every scrape is theft. Walk the four real threads — what copyright covers, how fair use is being argued, what licensing actually looks like, and which opt-out signals matter — then outline a sourcing policy you'd defend.
Phase 1What copyright protects and what training does
What copyright protects and what training does
Copyright protects expression, not ideas — and that gap matters for AI
7 minCopyright protects expression, not ideas — and that gap matters for AI
What training mechanically does to a copyrighted work
8 minWhat training mechanically does to a copyrighted work
The four factors of fair use — the lens every case uses
8 minThe four factors of fair use — the lens every case uses
US fair use, EU TDM exceptions, Japan's explicit carve-out
8 minUS fair use, EU TDM exceptions, Japan's explicit carve-out
Phase 2Map four lawsuits onto the fair-use factors
Map four lawsuits onto the fair-use factors
NYT v OpenAI — when memorization becomes evidence
8 minNYT v OpenAI — when memorization becomes evidence
Getty v Stability AI — when watermarks survive training
7 minGetty v Stability AI — when watermarks survive training
Andersen v Stability AI — when the artists sue, not the corporations
7 minAndersen v Stability AI — when the artists sue, not the corporations
Authors Guild v OpenAI — when novelists frame training as licensing-bypass
8 minAuthors Guild v OpenAI — when novelists frame training as licensing-bypass
Pattern across the four cases — what builders should actually take away
9 minPattern across the four cases — what builders should actually take away
Phase 3Licensing, opt-outs, and provenance signals
Licensing, opt-outs, and provenance signals
Your CEO asks 'can we just license everything?'
8 minYour CEO asks 'can we just license everything?'
Your engineer says 'robots.txt is enough — we respect it'
8 minYour engineer says 'robots.txt is enough — we respect it'
Provenance: C2PA, watermarking, and 'where did this come from?'
8 minProvenance: C2PA, watermarking, and 'where did this come from?'
Synthetic data and 'clean' training corpora — the new escape hatch?
8 minSynthetic data and 'clean' training corpora — the new escape hatch?
Phase 4Outline a defensible training-data sourcing policy
Outline a defensible training-data sourcing policy
Outline a training-data sourcing policy for a hypothetical generative product
10 minOutline a training-data sourcing policy for a hypothetical generative product
Frequently asked questions
- Is it legal to train an AI model on copyrighted material from the public web?
- This is covered in the “Understand Copyright in AI Training Data” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How does the four-factor fair-use test apply to training data?
- This is covered in the “Understand Copyright in AI Training Data” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What are the main AI training data lawsuits and what's actually being argued?
- This is covered in the “Understand Copyright in AI Training Data” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do robots.txt, ai.txt, and C2PA 'do-not-train' signals differ?
- This is covered in the “Understand Copyright in AI Training Data” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What should a training-data sourcing policy actually contain?
- This is covered in the “Understand Copyright in AI Training Data” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.