If you have ever shipped a Laravel AI feature and watched the UX collapse the first time a user asked a long question, you already know the problem. A blocking HTTP call to the Claude API on a long response can take ten, fifteen, sometimes twenty seconds to resolve. Your users don’t know the app is working. They refresh. They complain. They leave. This guide assumes you have a working Claude API connection. If you need the full integration foundation, start with the complete Laravel Claude API integration guide.

This guide solves that with a Laravel Claude streaming chatbot built on Server-Sent Events, a properly layered service architecture, and database-backed conversation memory. We are not building a demo. We are building something you can actually deploy. For persisting and replaying message history across sessions, the dedicated conversation memory guide continues from where this one ends.

Stack: Laravel 12, the Anthropic Messages API, SSE via response()->stream(), Eloquent for persistence, and the Service Container for wiring it cleanly.

Why Streaming Is Not Optional for AI Chat

Let us be direct about this. Without streaming, a Claude response that takes eight seconds to generate means eight seconds of a blank screen. Perceived performance dies. It does not matter that your server is healthy, your queue workers are running, and your database queries are fast. The user sees nothing and draws the worst conclusion.

Streaming fixes the perception problem by pushing tokens as they are generated. The first word appears in under a second on most responses. The user reads while the model writes. A failed mid-stream response is still partially useful, you get something instead of nothing, and you can handle the error gracefully on the frontend without making the entire interaction feel broken.

There is also a secondary benefit that most tutorials skip entirely: streaming makes backpressure visible. When you see tokens arriving slowly or stalling, you know something is wrong at the API level. With a blocking call, you just get a timeout and no context.

Architecture Overview

The request lifecycle looks like this:

A validated POST hits ChatStreamController.
The user message is persisted immediately to chat_messages.
The full conversation history is retrieved, formatted, and passed to ClaudeService.
ClaudeService opens a streaming HTTP connection to the Anthropic API.
Each token delta is pushed to the client via SSE as it arrives.
Once the stream closes, the complete assistant message is persisted.
The frontend listens on an EventSource, appending tokens to the UI in real time.

No logic lives in the controller beyond orchestration. The controller does not know what Claude is, what SSE is, or how messages are formatted. That belongs to the service layer, and keeping it there is what makes this testable and maintainable.

Prerequisites and Configuration

First, wire up your Anthropic credentials. Add this to config/services.php:

'anthropic' => [
    'key' => env('ANTHROPIC_API_KEY'),
    'version' => '2023-06-01',
],
```

Then in your `.env`:
```
ANTHROPIC_API_KEY=sk-ant-your-key-here

Register the service in bootstrap/app.php (Laravel 11/12, no separate AppServiceProvider file required unless you have already created one):

use App\Services\ClaudeService;

->withBindings([
    ClaudeService::class => ClaudeService::class,
])

Or, if you prefer explicit binding in a service provider:

$this->app->singleton(ClaudeService::class, function () {
    return new ClaudeService();
});

The singleton scope matters here. You do not want a fresh HTTP client instantiated on every request.

Step 1: Database Schema for Conversation Memory

Schema::create('chat_messages', function (Blueprint $table) {
    $table->id();
    $table->uuid('conversation_id');
    $table->enum('role', ['user', 'assistant', 'system']);
    $table->longText('content');
    $table->unsignedInteger('token_count')->nullable();
    $table->timestamps();
    $table->index('conversation_id');
});

The token_count column is new here and intentional. We will use it later to enforce memory pruning without re-counting tokens on every request. The conversation_id index is non-negotiable. Once a conversation has fifty messages, an unindexed scan on that column will hurt you.

Use longText for content. text columns cap at 65 KB. Claude responses on complex prompts will exceed that.

Step 2: The ChatMessage Model

namespace App\Models;

use Illuminate\Database\Eloquent\Model;

class ChatMessage extends Model
{
    protected $fillable = [
        'conversation_id',
        'role',
        'content',
        'token_count',
    ];

    protected $casts = [
        'token_count' => 'integer',
    ];

    public function scopeForConversation($query, string $conversationId)
    {
        return $query->where('conversation_id', $conversationId)->orderBy('id');
    }
}

The scopeForConversation local scope keeps query logic out of the controller. You will use it in multiple places once you add conversation summarisation, so defining it once here pays off immediately.

Step 3: ClaudeService, Streaming With Correct SSE Parsing

This is where the original article had its most significant bug. The Anthropic streaming API returns SSE-formatted text (data: {...}\n\n). When you iterate getBody() with a foreach, you get arbitrary binary chunks. A chunk might contain half a line, two lines, or three lines with a partial fourth. You cannot json_decode those chunks directly. You need a line buffer.

namespace App\Services;

use Illuminate\Support\Facades\Http;
use Illuminate\Http\Client\RequestException;

class ClaudeService
{
    private string $apiKey;
    private string $apiVersion;

    public function __construct()
    {
        $this->apiKey = config('services.anthropic.key');
        $this->apiVersion = config('services.anthropic.version', '2023-06-01');
    }

    public function stream(array $messages, callable $onChunk): void
    {
        try {
            $response = Http::withHeaders([
                'x-api-key'         => $this->apiKey,
                'anthropic-version' => $this->apiVersion,
                'Content-Type'      => 'application/json',
            ])
            ->timeout(120)
            ->withOptions(['stream' => true])
            ->post('https://api.anthropic.com/v1/messages', [
                'model'      => 'claude-sonnet-4-6',
                'max_tokens' => 1024,
                'messages'   => $messages,
                'stream'     => true,
            ]);

            if ($response->failed()) {
                throw new \RuntimeException(
                    'Claude API request failed: ' . $response->status()
                );
            }

            $body   = $response->toPsrResponse()->getBody();
            $buffer = '';

            while (!$body->eof()) {
                $buffer .= $body->read(1024);
                $lines   = explode("\n", $buffer);
                $buffer  = array_pop($lines); // retain incomplete line

                foreach ($lines as $line) {
                    $line = trim($line);

                    if (!str_starts_with($line, 'data: ')) {
                        continue;
                    }

                    $json = substr($line, 6);

                    if ($json === '[DONE]') {
                        return;
                    }

                    $decoded = json_decode($json, true);

                    if (json_last_error() !== JSON_ERROR_NONE) {
                        continue;
                    }

                    $type = $decoded['type'] ?? null;

                    if ($type === 'error') {
                        throw new \RuntimeException(
                            'Claude stream error: ' . ($decoded['error']['message'] ?? 'unknown')
                        );
                    }

                    if ($type === 'content_block_delta') {
                        $onChunk($decoded['delta']['text'] ?? '');
                    }
                }
            }
        } catch (RequestException $e) {
            throw new \RuntimeException(
                'HTTP client error during Claude stream: ' . $e->getMessage(),
                $e->getCode(),
                $e
            );
        }
    }

    public function formatMessages($messages): array
    {
        return $messages->map(fn ($m) => [
            'role'    => $m->role,
            'content' => [
                ['type' => 'text', 'text' => $m->content],
            ],
        ])->toArray();
    }
}

[Production Pitfall] The timeout(120) call is not optional. Without an explicit timeout, Guzzle inherits PHP’s default_socket_timeout (usually 60 seconds), which will terminate long Claude responses mid-stream without throwing an exception you can catch. On complex prompts, 120 seconds is a reasonable ceiling. If you are on a queue worker rather than a web request, you can push this higher.

Step 4: The Streaming Controller

namespace App\Http\Controllers;

use App\Models\ChatMessage;
use App\Services\ClaudeService;
use Illuminate\Http\Request;

class ChatStreamController extends Controller
{
    public function stream(Request $request, ClaudeService $claude)
    {
        $validated = $request->validate([
            'conversation_id' => ['required', 'uuid'],
            'message'         => ['required', 'string', 'max:4000'],
        ]);

        ChatMessage::create([
            'conversation_id' => $validated['conversation_id'],
            'role'            => 'user',
            'content'         => $validated['message'],
        ]);

        $messages      = ChatMessage::scopeForConversation(
                            ChatMessage::query(),
                            $validated['conversation_id']
                         )->get();

        $claudeMessages = $claude->formatMessages(
            $this->pruneHistory($messages)
        );

        $conversationId = $validated['conversation_id'];

        return response()->stream(function () use ($claude, $claudeMessages, $conversationId) {
            $assistantText = '';

            try {
                $claude->stream($claudeMessages, function ($chunk) use (&$assistantText) {
                    $assistantText .= $chunk;
                    echo 'data: ' . json_encode(['text' => $chunk]) . "\n\n";
                    ob_flush();
                    flush();
                });
            } catch (\RuntimeException $e) {
                echo 'event: error' . "\n";
                echo 'data: ' . json_encode(['message' => $e->getMessage()]) . "\n\n";
                ob_flush();
                flush();
                return;
            }

            if (!empty($assistantText)) {
                ChatMessage::create([
                    'conversation_id' => $conversationId,
                    'role'            => 'assistant',
                    'content'         => $assistantText,
                ]);
            }

            echo "event: end\ndata: done\n\n";
            ob_flush();
            flush();

        }, 200, [
            'Content-Type'  => 'text/event-stream',
            'Cache-Control' => 'no-cache',
            'Connection'    => 'keep-alive',
            'X-Accel-Buffering' => 'no',
        ]);
    }

    private function pruneHistory($messages, int $limit = 20)
    {
        if ($messages->count() <= $limit) {
            return $messages;
        }

        return $messages->slice(-$limit)->values();
    }
}

Note the X-Accel-Buffering: no header. If you are sitting behind Nginx (and you probably are in production), Nginx buffers proxy responses by default. Without this header, tokens accumulate in Nginx’s buffer and the client receives them in large batches, which completely defeats the point of streaming. This is the single most common reason streaming “works locally but not in production.”

Step 5: Routing

// routes/web.php
use App\Http\Controllers\ChatStreamController;

Route::post('/chat/stream', [ChatStreamController::class, 'stream'])
    ->middleware(['auth', 'throttle:60,1']);

Apply throttle middleware at the route level as a first line of defence. For production, you will want something more granular, per-user token tracking with a dedicated middleware layer. If you have not built that yet, the Laravel AI Middleware: Token Tracking & Rate Limiting guide on this site covers exactly that pattern, including tiered rate limiting per user plan.

Step 6: Frontend SSE Consumer

async function sendMessage(conversationId, message) {
    const url = new URL('/chat/stream', window.location.origin);

    const response = await fetch(url.toString(), {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
            'X-CSRF-TOKEN': document.querySelector('meta[name="csrf-token"]').content,
        },
        body: JSON.stringify({ conversation_id: conversationId, message }),
    });

    const reader    = response.body.getReader();
    const decoder   = new TextDecoder();
    const output    = document.querySelector('#output');
    let   buffer    = '';

    while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split('\n');
        buffer = lines.pop();

        for (const line of lines) {
            if (line.startsWith('data: ')) {
                const json = line.slice(6).trim();
                if (json === 'done') continue;
                try {
                    const parsed = JSON.parse(json);
                    if (parsed.text) output.textContent += parsed.text;
                    if (parsed.message) console.error('Stream error:', parsed.message);
                } catch (_) {}
            }

            if (line.startsWith('event: end')) {
                reader.cancel();
                return;
            }
        }
    }
}

We have moved away from EventSource here and are using fetch with a ReadableStream reader instead. The reason: EventSource only supports GET requests, which means you cannot include a CSRF token or a JSON body. For anything beyond a trivial prototype, use fetch with a stream reader.

Memory Management: The Part That Actually Keeps Costs Down

Naive implementations send the entire conversation history to Claude on every turn. That is fine for a five-message thread. It is catastrophic for a two-hundred-message one.

Token costs scale linearly with input size. A 200-message conversation, each averaging 150 tokens, adds 30,000 input tokens to every single API call. At current pricing, that adds up fast – and you hit Claude’s context window limit before you notice the bill.

A practical approach involves three layers:

Layer 1: Recency window. Send only the last 20 messages verbatim. The pruneHistory() method in our controller handles this. Tune the limit based on your average message length.

Layer 2: Summarisation. For conversations that exceed the recency window, generate a one-paragraph summary of older messages and inject it as a system role message at the top of the array. Claude handles factual compression well. The summary replaces what would have been fifty messages with two hundred tokens.

Layer 3: Token estimation. Before every API call, estimate the total input token count (roughly four characters per token as a heuristic). If you are approaching your budget, either tighten the recency window or trigger a summarisation cycle proactively.

[Architect’s Note] Summarisation should not run synchronously in the request lifecycle. Dispatch it as a queued job when message count crosses a threshold. The current request uses the existing history; the next request benefits from the compacted summary. This is the pattern used in production systems that handle thousands of concurrent conversations. If you want to understand the broader service design decisions behind this kind of architecture, the Production-Grade AI Architecture in Laravel: Contracts, Governance & Telemetry guide covers the contract and governance layer that sits above what we are building here.

Handling Rate Limits From the Anthropic API

Anthropic enforces rate limits at the token-per-minute and requests-per-minute level. The Anthropic API rate limit documentation outlines the specifics by tier, but the key point for Laravel is: you need to handle 429 responses gracefully.

Add retry logic to your HTTP client configuration in ClaudeService:

$response = Http::withHeaders([...])
    ->timeout(120)
    ->retry(3, 2000, function ($exception, $request) {
        if ($exception instanceof \Illuminate\Http\Client\RequestException) {
            return $exception->response->status() === 429;
        }
        return false;
    })
    ->withOptions(['stream' => true])
    ->post('https://api.anthropic.com/v1/messages', [...]);

The retry(3, 2000) call attempts the request up to three times with a 2000ms base delay, but only on a 429 response. Do not retry on 400s, those are your bugs, not theirs. Do not retry on 500s unconditionally either; a 529 (overloaded) is worth a retry, but a standard 500 may indicate a malformed payload.

Testing the Service in Isolation

Because ClaudeService is registered in the Service Container, you can swap it for a fake in tests without touching the controller:

// In a Feature test
$this->instance(ClaudeService::class, new class {
    public function stream(array $messages, callable $onChunk): void
    {
        $onChunk('Hello ');
        $onChunk('from ');
        $onChunk('fake Claude.');
    }

    public function formatMessages($messages): array
    {
        return $messages->toArray();
    }
});

This is why the service layer exists. A controller that calls Http::post() directly is untestable without HTTP faking. A controller that depends on ClaudeService can be tested with a two-line anonymous class. The Laravel HTTP Client documentation covers Http::fake() as well if you prefer testing at the HTTP layer.

Common Failure Modes

Failure	Root Cause	Fix
Tokens arrive in bursts, not individually	Nginx buffering	Add `X-Accel-Buffering: no` header
Stream works locally, silently fails in production	PHP output buffering enabled	Confirm `ob_flush()` + `flush()` and check `output_buffering` in `php.ini`
JSON decode fails on chunks	Raw chunk iteration without line buffer	Use the `read()` + `eof()` loop shown above
Context grows unbounded	No memory pruning	Implement recency window + summarisation
429 errors under load	No retry logic	Use `Http::retry()` with 429-scoped condition
Lost assistant messages	Persisting before stream closes	Always persist inside the stream callback, after stream completes
CSRF token rejection on SSE	Using `EventSource` (GET only)	Switch to `fetch()` with `ReadableStream` as shown above

When Streaming Is the Wrong Choice

Streaming is a UX decision for interactive, human-facing responses. It is not a default you apply universally.

Do not stream for batch jobs. If you are processing a thousand documents overnight through the Claude API, nobody is watching a cursor blink. Run those as queued jobs with Http::post() and move on. Way simpler, and you eliminate the complexity of managing open connections at scale.

Do not stream for very short responses. If your average response is two sentences, the time-to-first-token advantage is negligible. The added frontend and backend complexity is not worth it.

Do not stream if you need strict response validation before displaying anything. Streaming means you show text before you have seen all of it. If your downstream logic needs to validate or transform the full response first (say, parsing a JSON tool call result), buffer it internally and display it as a unit.

Streaming is powerful. It is also a pattern with a real implementation cost. Apply it where it demonstrably improves the user experience, and default to simpler async patterns everywhere else.

Dewald Hugo

A software architect with 15+ years of experience in the PHP and Laravel ecosystem. Dewald created Origin Main to provide the engineering rigour required to integrate AI into professional, high-concurrency production systems. He writes for developers who care less about "getting it to work" and more about "getting it to last".

2 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Kylan Gentry

3 months ago

I’ve set up my ClaudeService to use the streaming option, and my controller uses response()->stream().
The issue is that when I try to parse the chunks coming from the Anthropic API, json_decode always returns null, and the frontend doesn’t show anything until the entire request is complete—completely defeating the purpose of streaming.

Here is my service implementation:

// app/Services/ClaudeService.php
public function stream(array $messages, callable $onChunk): void
{
    $response = Http::withHeaders([
        'x-api-key' => config('services.anthropic.key'),
        'anthropic-version' => '2023-06-01',
    ])->withOptions([
        'stream' => true,
    ])->post('https://api.anthropic.com/v1/messages', [
        'model' => 'claude-3-5-sonnet-20240620',
        'max_tokens' => 800,
        'messages' => $messages,
        'stream' => true,
    ]);

    foreach ($response->toPsrResponse()->getBody() as $chunk) {
        // This is where it fails. $chunk looks like "data: {"type": "content_block_delta", ...}"
        $decoded = json_decode($chunk, true); 

        if ($decoded && $decoded['type'] === 'content_block_delta') {
            $onChunk($decoded['delta']['text'] ?? '');
        }
    }
}

In my Controller, I’m using ob_flush() and flush(), but the browser still waits for the “end” event before rendering anything.

My Questions:

Why is json_decode($chunk) failing? When I log the $chunk, it contains the string data: at the start. Should I be stripping that manually?
Even if I fix the parsing, why does Nginx/Laravel seem to buffer the response instead of sending tokens to the browser immediately?

Author

Reply to Kylan Gentry

There are two separate issues at play here: one related to the SSE (Server-Sent Events) protocol and one related to Web Server buffering.

1. The JSON Issue: Strip the data: prefix
The Anthropic API (and OpenAI) sends responses in the SSE format. Each chunk from the stream looks like this: data: {"type": "content_block_delta", ...}
json_decode() fails because data: is not valid JSON. You need to strip that prefix and ignore any empty lines or “ping” events the API sends to keep the connection alive.

Update your ClaudeService like this:

foreach ($response->toPsrResponse()->getBody() as $chunk) {
    // 1. Split the chunk by newlines (a single chunk may contain multiple events)
    $lines = explode("\n", $chunk);

    foreach ($lines as $line) {
        $line = trim($line);

        // 2. Only process lines that start with 'data: '
        if (!str_starts_with($line, 'data: ')) {
            continue;
        }

        // 3. Remove the 'data: ' prefix
        $json = substr($line, 6);

        $decoded = json_decode($json, true);

        if ($decoded && isset($decoded['type']) && $decoded['type'] === 'content_block_delta') {
            $onChunk($decoded['delta']['text'] ?? '');
        }
    }
}

2. The Buffering Issue: Nginx & PHP
Even if your PHP code flushes the buffer, Nginx will often wait until it has 4KB or 8KB of data before sending it to the client. This makes the “stream” feel like a single blocking request.
To fix this, you must tell Nginx to disable buffering for this specific response.

Update your Controller headers:

return response()->stream(function () use ($claude, $claudeMessages, $conversationId) {
    // ... logic ...
}, 200, [
    'Content-Type' => 'text/event-stream',
    'Cache-Control' => 'no-cache',
    'Connection' => 'keep-alive',
    'X-Accel-Buffering' => 'no', // <--- ADD THIS FOR NGINX
]);

3. Ensure PHP doesn’t buffer locally
At the very top of your stream function in the controller, it is a good practice to explicitly disable PHP’s own output buffering to ensure every echo goes straight to the web server:

return response()->stream(function () {
    // Disable PHP output buffering
    while (ob_get_level()) {
        ob_end_flush();
    }
    
    // ... your loop ...
});

Last edited 3 months ago by Dewald Hugo