🗺️Choose an LLM Deployment Topology
Stop choosing between 'just call OpenAI' and 'self-host on H100s' — there are four real LLM topologies in between. By the end you can sketch a 12-month plan that survives 10x traffic growth.
Phase 1Why Topology Beats 'API vs Self-Host'
See the four topologies most teams skip past
'Should we self-host?' is the wrong first question
6 minThe real question isn't API vs self-host — it's which endpoints have constraints (latency, cost, sovereignty) that a single-provider API cannot meet.
Three constraints actually drive topology: latency, cost, sovereignty
6 minAlmost every topology choice boils down to which of three constraints binds hardest: p95 latency, $/1M tokens, or where the data must live.
The four topologies in plain language
6 minSingle API → gateway → hybrid hot-path → full self-host is a graduation ladder, not a binary; each step adds one capability the previous lacks.
The three topology decisions teams regret most
7 minPremature self-host, vendor lock-in via SDK leakage, and 'multi-region from day one' eat more engineering time than any actual scaling event.
Phase 2The Latency-Cost 2x2 That Picks Your Topology
Map your endpoints to a latency-cost 2x2
Catalog every LLM call site in your app
7 minYou can't choose a topology until you've inventoried every endpoint, its volume, and which constraint binds — and most teams have never done this exercise.
Plot every endpoint on a latency-cost 2x2
7 minClassify each endpoint as latency-sensitive (yes/no) and cost-sensitive (yes/no); each quadrant maps to a specific topology recommendation.
Calculate the break-even point for self-hosting
7 minSelf-hosting an open-weight model pays off when monthly API spend on a single endpoint exceeds roughly 3x the cost of one always-on GPU pod plus eval/ops time.
Measure p95 latency before you optimize it
7 minMost teams optimize for the wrong latency number — they look at provider SLAs instead of measuring real end-to-end p95 from their own infrastructure.
Catalog sovereignty constraints before they catalog you
7 minData residency and contractual restrictions on AI processing limit your topology options long before cost or latency does — and they're often invisible in engineering meetings.
Phase 3Layering Gateways, Fallback, and Self-Host
Layer gateways, fallback, and self-host where they pay
When an LLM gateway pays for itself
7 minA gateway is worth adding the moment you have a second provider or a reason to route by request type — not before.
Regional fallback: cheap insurance against provider outages
7 minA second provider configured as fallback turns 4-hour outages into 30-second blips, and costs almost nothing until the day it saves you.
Hybrid hot-path self-host: when one route earns its own GPU
8 minSelf-hosting one high-volume route while keeping everything else on the API is the most cost-effective topology for apps in the $20K-$200K/month spend range.
Edge inference: where it pays and where it's premature
7 minEdge inference shines for sub-200ms latency-critical use cases on small models, and is overkill for anything else — including most chat applications.
Phase 4Sketching a 12-Month Plan for 10x Growth
Sketch a 12-month plan for 10x traffic growth
Sketch a 12-month topology plan for 10x traffic growth
10 minA good 12-month topology plan names where you start, what triggers each step up the ladder, and what you explicitly choose NOT to build.
Frequently asked questions
- When does self-hosting an open-weight LLM actually beat the API?
- This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is an LLM gateway and when do I need one?
- This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do I plan for 10x traffic growth without re-architecting?
- This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Is multi-provider routing worth the operational complexity?
- This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Where does edge inference fit in an LLM deployment plan?
- This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.