Back to library

🗺️Choose an LLM Deployment Topology

Stop choosing between 'just call OpenAI' and 'self-host on H100s' — there are four real LLM topologies in between. By the end you can sketch a 12-month plan that survives 10x traffic growth.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why Topology Beats 'API vs Self-Host'

See the four topologies most teams skip past

4 drops
  1. 'Should we self-host?' is the wrong first question

    6 min

    The real question isn't API vs self-host — it's which endpoints have constraints (latency, cost, sovereignty) that a single-provider API cannot meet.

  2. Three constraints actually drive topology: latency, cost, sovereignty

    6 min

    Almost every topology choice boils down to which of three constraints binds hardest: p95 latency, $/1M tokens, or where the data must live.

  3. The four topologies in plain language

    6 min

    Single API → gateway → hybrid hot-path → full self-host is a graduation ladder, not a binary; each step adds one capability the previous lacks.

  4. The three topology decisions teams regret most

    7 min

    Premature self-host, vendor lock-in via SDK leakage, and 'multi-region from day one' eat more engineering time than any actual scaling event.

Phase 2The Latency-Cost 2x2 That Picks Your Topology

Map your endpoints to a latency-cost 2x2

5 drops
  1. Catalog every LLM call site in your app

    7 min

    You can't choose a topology until you've inventoried every endpoint, its volume, and which constraint binds — and most teams have never done this exercise.

  2. Plot every endpoint on a latency-cost 2x2

    7 min

    Classify each endpoint as latency-sensitive (yes/no) and cost-sensitive (yes/no); each quadrant maps to a specific topology recommendation.

  3. Calculate the break-even point for self-hosting

    7 min

    Self-hosting an open-weight model pays off when monthly API spend on a single endpoint exceeds roughly 3x the cost of one always-on GPU pod plus eval/ops time.

  4. Measure p95 latency before you optimize it

    7 min

    Most teams optimize for the wrong latency number — they look at provider SLAs instead of measuring real end-to-end p95 from their own infrastructure.

  5. Catalog sovereignty constraints before they catalog you

    7 min

    Data residency and contractual restrictions on AI processing limit your topology options long before cost or latency does — and they're often invisible in engineering meetings.

Phase 3Layering Gateways, Fallback, and Self-Host

Layer gateways, fallback, and self-host where they pay

4 drops
  1. When an LLM gateway pays for itself

    7 min

    A gateway is worth adding the moment you have a second provider or a reason to route by request type — not before.

  2. Regional fallback: cheap insurance against provider outages

    7 min

    A second provider configured as fallback turns 4-hour outages into 30-second blips, and costs almost nothing until the day it saves you.

  3. Hybrid hot-path self-host: when one route earns its own GPU

    8 min

    Self-hosting one high-volume route while keeping everything else on the API is the most cost-effective topology for apps in the $20K-$200K/month spend range.

  4. Edge inference: where it pays and where it's premature

    7 min

    Edge inference shines for sub-200ms latency-critical use cases on small models, and is overkill for anything else — including most chat applications.

Phase 4Sketching a 12-Month Plan for 10x Growth

Sketch a 12-month plan for 10x traffic growth

1 drop
  1. Sketch a 12-month topology plan for 10x traffic growth

    10 min

    A good 12-month topology plan names where you start, what triggers each step up the ladder, and what you explicitly choose NOT to build.

Frequently asked questions

When does self-hosting an open-weight LLM actually beat the API?
This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is an LLM gateway and when do I need one?
This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I plan for 10x traffic growth without re-architecting?
This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Is multi-provider routing worth the operational complexity?
This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Where does edge inference fit in an LLM deployment plan?
This is covered in the “Choose an LLM Deployment Topology” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.