What is the difference between SSE and WebSockets for streaming LLMs?

This is covered in the "Stream LLM Responses for Snappy UX" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why does my streamed LLM response feel slow even when tokens arrive fast?

This is covered in the "Stream LLM Responses for Snappy UX" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How do I cancel an in-flight LLM stream from the browser?

This is covered in the "Stream LLM Responses for Snappy UX" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How do I stream structured output or tool calls without breaking JSON parsing?

This is covered in the "Stream LLM Responses for Snappy UX" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What causes dropped tokens when retrying a streamed LLM response?

This is covered in the "Stream LLM Responses for Snappy UX" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

📡Stream LLM Responses for Snappy UX

Stop shipping six-second blank screens — switch to SSE streaming and watch perceived latency collapse from seconds to milliseconds. By the end you'll add a stop button and graceful retry to a streamed chat without dropping tokens.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Time-to-First-Token: The Latency Users Actually Feel

Why time-to-first-token is the latency users actually feel

4 drops

Total latency lies — time-to-first-token is what users feel
6 min
Users don't experience response time as a single number. They experience two: how long until something appears, and how long until it's done. The first one decides whether they stay.
SSE is just HTTP that doesn't close — that's the whole trick
6 min
Server-Sent Events isn't a new protocol. It's a regular HTTP response with a content type that tells the browser 'don't close this connection until I do.'
The model emits tokens, not characters — and that changes your UI
6 min
An LLM doesn't stream letters. It streams tokens, which are roughly 3-4 characters of text glued together. Your UI has to render half-words convincingly.
Streaming is free perceived speed but costs you simplicity
7 min
Streaming gives you a 10x perceived-latency win for almost no infrastructure cost. The actual cost is that every error, retry, and partial-output case in your code now has to think about half-finished responses.

Phase 2Stream Through a Next.js SSE Endpoint

Stream OpenAI through a Next.js SSE endpoint and measure it

5 drops

A Next.js streaming route is a ReadableStream, not a JSON response
7 min
You don't return data from a Next.js route handler when streaming. You return a Response wrapped around a ReadableStream that the runtime drains as you write to it.
Pass `stream: true` and the SDK hands you an async iterator
7 min
The OpenAI SDK doesn't return a different object for streaming. It returns the same kind of object — an async iterable of chunks. `for await` is the only ceremony.
Fetch + ReadableStream beats EventSource for LLM clients
7 min
EventSource is the textbook way to consume SSE — and the wrong way for LLM clients. It can't send POST bodies, can't set auth headers, and can't be aborted cleanly.
Measure TTFT and inter-token latency, not 'response time'
7 min
If your telemetry has one number for LLM latency, you can't tell whether streaming is working. You need two: time-to-first-token, and median inter-token latency.
The first chunk is empty — and that's not a bug
7 min
OpenAI's streaming API sends a 'role' chunk before any content arrives. If your code keys off the first chunk being meaningful, you'll think streaming is broken when it's working perfectly.

Phase 3Buffering, Aborts, Tool Calls, Structured Output

Handle buffering, abort, tool calls, and structured output

4 drops

Your stream works in dev and arrives all at once in production
7 min
A streaming endpoint that ships to a CDN-backed deployment without the right headers will be silently buffered into a single response — and the symptom is users complaining your 'streaming' app feels exactly like the old one.
User clicks stop mid-response — and tokens keep arriving for two seconds
7 min
Calling `controller.abort()` on the client doesn't stop tokens. It only closes the connection. The server keeps generating, charging you for tokens nobody will see, until you also abort the upstream LLM call.
Tool-call streaming arrives in fragments — and JSON.parse will betray you
8 min
When the model streams a tool call, it streams the arguments as a JSON string, character by character. You can't `JSON.parse` mid-stream — you have to accumulate every fragment and parse only on completion.
Streaming and JSON-mode fight each other — pick a strategy
7 min
When the model is producing JSON via response_format or a Zod schema, partial output is invalid by definition. You can stream it, but you can't show it as JSON until the very end.

Phase 4Ship Stop and Retry Without Dropping Tokens

Ship a stop button and retry that never drops tokens

1 drop

Build the stop + retry chat that never drops a token
8 min
Build the stop + retry chat that never drops a token

Frequently asked questions

What is the difference between SSE and WebSockets for streaming LLMs?: This is covered in the “Stream LLM Responses for Snappy UX” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does my streamed LLM response feel slow even when tokens arrive fast?: This is covered in the “Stream LLM Responses for Snappy UX” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I cancel an in-flight LLM stream from the browser?: This is covered in the “Stream LLM Responses for Snappy UX” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I stream structured output or tool calls without breaking JSON parsing?: This is covered in the “Stream LLM Responses for Snappy UX” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What causes dropped tokens when retrying a streamed LLM response?: This is covered in the “Stream LLM Responses for Snappy UX” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

📡Stream LLM Responses for Snappy UX

Phase 1Time-to-First-Token: The Latency Users Actually Feel

Total latency lies — time-to-first-token is what users feel

SSE is just HTTP that doesn't close — that's the whole trick

The model emits tokens, not characters — and that changes your UI

Streaming is free perceived speed but costs you simplicity

Phase 2Stream Through a Next.js SSE Endpoint

A Next.js streaming route is a ReadableStream, not a JSON response

Pass `stream: true` and the SDK hands you an async iterator

Fetch + ReadableStream beats EventSource for LLM clients

Measure TTFT and inter-token latency, not 'response time'

The first chunk is empty — and that's not a bug

Phase 3Buffering, Aborts, Tool Calls, Structured Output

Your stream works in dev and arrives all at once in production

User clicks stop mid-response — and tokens keep arriving for two seconds

Tool-call streaming arrives in fragments — and JSON.parse will betray you

Streaming and JSON-mode fight each other — pick a strategy

Phase 4Ship Stop and Retry Without Dropping Tokens

Build the stop + retry chat that never drops a token

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Time-to-First-Token: The Latency Users Actually Feel

Total latency lies — time-to-first-token is what users feel

SSE is just HTTP that doesn't close — that's the whole trick

The model emits tokens, not characters — and that changes your UI

Streaming is free perceived speed but costs you simplicity

Phase 2Stream Through a Next.js SSE Endpoint

A Next.js streaming route is a ReadableStream, not a JSON response

Pass `stream: true` and the SDK hands you an async iterator

Fetch + ReadableStream beats EventSource for LLM clients

Measure TTFT and inter-token latency, not 'response time'

The first chunk is empty — and that's not a bug

Phase 3Buffering, Aborts, Tool Calls, Structured Output

Your stream works in dev and arrives all at once in production

User clicks stop mid-response — and tokens keep arriving for two seconds

Tool-call streaming arrives in fragments — and JSON.parse will betray you

Streaming and JSON-mode fight each other — pick a strategy

Phase 4Ship Stop and Retry Without Dropping Tokens

Build the stop + retry chat that never drops a token

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition