Back to library

📡Stream LLM Responses for Snappy UX

Stop shipping six-second blank screens — switch to SSE streaming and watch perceived latency collapse from seconds to milliseconds. By the end you'll add a stop button and graceful retry to a streamed chat without dropping tokens.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Time-to-First-Token: The Latency Users Actually Feel

Why time-to-first-token is the latency users actually feel

4 drops
  1. Total latency lies — time-to-first-token is what users feel

    6 min

    Users don't experience response time as a single number. They experience two: how long until something appears, and how long until it's done. The first one decides whether they stay.

  2. SSE is just HTTP that doesn't close — that's the whole trick

    6 min

    Server-Sent Events isn't a new protocol. It's a regular HTTP response with a content type that tells the browser 'don't close this connection until I do.'

  3. The model emits tokens, not characters — and that changes your UI

    6 min

    An LLM doesn't stream letters. It streams tokens, which are roughly 3-4 characters of text glued together. Your UI has to render half-words convincingly.

  4. Streaming is free perceived speed but costs you simplicity

    7 min

    Streaming gives you a 10x perceived-latency win for almost no infrastructure cost. The actual cost is that every error, retry, and partial-output case in your code now has to think about half-finished responses.

Phase 2Stream Through a Next.js SSE Endpoint

Stream OpenAI through a Next.js SSE endpoint and measure it

5 drops
  1. A Next.js streaming route is a ReadableStream, not a JSON response

    7 min

    You don't return data from a Next.js route handler when streaming. You return a Response wrapped around a ReadableStream that the runtime drains as you write to it.

  2. Pass `stream: true` and the SDK hands you an async iterator

    7 min

    The OpenAI SDK doesn't return a different object for streaming. It returns the same kind of object — an async iterable of chunks. `for await` is the only ceremony.

  3. Fetch + ReadableStream beats EventSource for LLM clients

    7 min

    EventSource is the textbook way to consume SSE — and the wrong way for LLM clients. It can't send POST bodies, can't set auth headers, and can't be aborted cleanly.

  4. Measure TTFT and inter-token latency, not 'response time'

    7 min

    If your telemetry has one number for LLM latency, you can't tell whether streaming is working. You need two: time-to-first-token, and median inter-token latency.

  5. The first chunk is empty — and that's not a bug

    7 min

    OpenAI's streaming API sends a 'role' chunk before any content arrives. If your code keys off the first chunk being meaningful, you'll think streaming is broken when it's working perfectly.

Phase 3Buffering, Aborts, Tool Calls, Structured Output

Handle buffering, abort, tool calls, and structured output

4 drops
  1. Your stream works in dev and arrives all at once in production

    7 min

    A streaming endpoint that ships to a CDN-backed deployment without the right headers will be silently buffered into a single response — and the symptom is users complaining your 'streaming' app feels exactly like the old one.

  2. User clicks stop mid-response — and tokens keep arriving for two seconds

    7 min

    Calling `controller.abort()` on the client doesn't stop tokens. It only closes the connection. The server keeps generating, charging you for tokens nobody will see, until you also abort the upstream LLM call.

  3. Tool-call streaming arrives in fragments — and JSON.parse will betray you

    8 min

    When the model streams a tool call, it streams the arguments as a JSON string, character by character. You can't `JSON.parse` mid-stream — you have to accumulate every fragment and parse only on completion.

  4. Streaming and JSON-mode fight each other — pick a strategy

    7 min

    When the model is producing JSON via response_format or a Zod schema, partial output is invalid by definition. You can stream it, but you can't show it as JSON until the very end.

Phase 4Ship Stop and Retry Without Dropping Tokens

Ship a stop button and retry that never drops tokens

1 drop
  1. Build the stop + retry chat that never drops a token

    8 min

    Build the stop + retry chat that never drops a token

Frequently asked questions

What is the difference between SSE and WebSockets for streaming LLMs?
This is covered in the “Stream LLM Responses for Snappy UX” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does my streamed LLM response feel slow even when tokens arrive fast?
This is covered in the “Stream LLM Responses for Snappy UX” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I cancel an in-flight LLM stream from the browser?
This is covered in the “Stream LLM Responses for Snappy UX” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I stream structured output or tool calls without breaking JSON parsing?
This is covered in the “Stream LLM Responses for Snappy UX” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What causes dropped tokens when retrying a streamed LLM response?
This is covered in the “Stream LLM Responses for Snappy UX” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.