When does self-hosting an open-weight LLM actually beat the API?

This is covered in the "Choose an LLM Deployment Topology" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What is an LLM gateway and when do I need one?

This is covered in the "Choose an LLM Deployment Topology" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How do I plan for 10x traffic growth without re-architecting?

This is covered in the "Choose an LLM Deployment Topology" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Is multi-provider routing worth the operational complexity?

This is covered in the "Choose an LLM Deployment Topology" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Where does edge inference fit in an LLM deployment plan?

This is covered in the "Choose an LLM Deployment Topology" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

🗺️Choose an LLM Deployment Topology

Stop choosing between 'just call OpenAI' and 'self-host on H100s' — there are four real LLM topologies in between. By the end you can sketch a 12-month plan that survives 10x traffic growth.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why Topology Beats 'API vs Self-Host'

See the four topologies most teams skip past

4 drops

'Should we self-host?' is the wrong first question
6 min
The real question isn't API vs self-host — it's which endpoints have constraints (latency, cost, sovereignty) that a single-provider API cannot meet.
Three constraints actually drive topology: latency, cost, sovereignty
6 min
Almost every topology choice boils down to which of three constraints binds hardest: p95 latency, $/1M tokens, or where the data must live.
The four topologies in plain language
6 min
Single API → gateway → hybrid hot-path → full self-host is a graduation ladder, not a binary; each step adds one capability the previous lacks.
The three topology decisions teams regret most
7 min
Premature self-host, vendor lock-in via SDK leakage, and 'multi-region from day one' eat more engineering time than any actual scaling event.

Phase 2The Latency-Cost 2x2 That Picks Your Topology

Map your endpoints to a latency-cost 2x2

5 drops

Catalog every LLM call site in your app
7 min
You can't choose a topology until you've inventoried every endpoint, its volume, and which constraint binds — and most teams have never done this exercise.
Plot every endpoint on a latency-cost 2x2
7 min
Classify each endpoint as latency-sensitive (yes/no) and cost-sensitive (yes/no); each quadrant maps to a specific topology recommendation.
Calculate the break-even point for self-hosting
7 min
Self-hosting an open-weight model pays off when monthly API spend on a single endpoint exceeds roughly 3x the cost of one always-on GPU pod plus eval/ops time.
Measure p95 latency before you optimize it
7 min
Most teams optimize for the wrong latency number — they look at provider SLAs instead of measuring real end-to-end p95 from their own infrastructure.
Catalog sovereignty constraints before they catalog you
7 min
Data residency and contractual restrictions on AI processing limit your topology options long before cost or latency does — and they're often invisible in engineering meetings.

Phase 3Layering Gateways, Fallback, and Self-Host

Layer gateways, fallback, and self-host where they pay

4 drops

When an LLM gateway pays for itself
7 min
A gateway is worth adding the moment you have a second provider or a reason to route by request type — not before.
Regional fallback: cheap insurance against provider outages
7 min
A second provider configured as fallback turns 4-hour outages into 30-second blips, and costs almost nothing until the day it saves you.
Hybrid hot-path self-host: when one route earns its own GPU
8 min
Self-hosting one high-volume route while keeping everything else on the API is the most cost-effective topology for apps in the $20K-$200K/month spend range.
Edge inference: where it pays and where it's premature
7 min
Edge inference shines for sub-200ms latency-critical use cases on small models, and is overkill for anything else — including most chat applications.

Phase 4Sketching a 12-Month Plan for 10x Growth

Sketch a 12-month plan for 10x traffic growth

1 drop

Sketch a 12-month topology plan for 10x traffic growth
10 min
A good 12-month topology plan names where you start, what triggers each step up the ladder, and what you explicitly choose NOT to build.

Frequently asked questions

When does self-hosting an open-weight LLM actually beat the API?: This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is an LLM gateway and when do I need one?: This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I plan for 10x traffic growth without re-architecting?: This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Is multi-provider routing worth the operational complexity?: This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Where does edge inference fit in an LLM deployment plan?: This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🗺️Choose an LLM Deployment Topology

Phase 1Why Topology Beats 'API vs Self-Host'

'Should we self-host?' is the wrong first question

Three constraints actually drive topology: latency, cost, sovereignty

The four topologies in plain language

The three topology decisions teams regret most

Phase 2The Latency-Cost 2x2 That Picks Your Topology

Catalog every LLM call site in your app

Plot every endpoint on a latency-cost 2x2

Calculate the break-even point for self-hosting

Measure p95 latency before you optimize it

Catalog sovereignty constraints before they catalog you

Phase 3Layering Gateways, Fallback, and Self-Host

When an LLM gateway pays for itself

Regional fallback: cheap insurance against provider outages

Hybrid hot-path self-host: when one route earns its own GPU

Edge inference: where it pays and where it's premature

Phase 4Sketching a 12-Month Plan for 10x Growth

Sketch a 12-month topology plan for 10x traffic growth

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Why Topology Beats 'API vs Self-Host'

'Should we self-host?' is the wrong first question

Three constraints actually drive topology: latency, cost, sovereignty

The four topologies in plain language

The three topology decisions teams regret most

Phase 2The Latency-Cost 2x2 That Picks Your Topology

Catalog every LLM call site in your app

Plot every endpoint on a latency-cost 2x2

Calculate the break-even point for self-hosting

Measure p95 latency before you optimize it

Catalog sovereignty constraints before they catalog you

Phase 3Layering Gateways, Fallback, and Self-Host

When an LLM gateway pays for itself

Regional fallback: cheap insurance against provider outages

Hybrid hot-path self-host: when one route earns its own GPU

Edge inference: where it pays and where it's premature

Phase 4Sketching a 12-Month Plan for 10x Growth

Sketch a 12-month topology plan for 10x traffic growth

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition