Do I need Anthropic's official SDK or is raw HTTP sufficient for production?

Raw HTTP is the right call for PHP. Anthropic doesn't maintain an official PHP SDK, and the third-party options available are either unmaintained or thin wrappers that add a dependency without adding meaningful value. Laravel's HTTP client (backed by Guzzle) gives you everything you need — retry logic, timeout control, response handling — and keeps the integration entirely within tooling your team already understands and debugs. The one exception is streaming, where you drop to raw Guzzle directly, as covered in the streaming section above.

How accurate is the 4-characters-per-token estimation?

Accurate enough to be a useful guardrail, not accurate enough to be a billing tool. English prose tends to run slightly under 4 characters per token. Code, JSON, and non-Latin scripts can skew significantly. For pre-flight checks — catching runaway document sizes before they hit the API — the heuristic is fine. For cost forecasting, use the actual input_tokens and output_tokens values from Claude's response and track them in ai_usage_logs over time. Real usage data will always beat a formula.

Should I use the system prompt or inject context directly into the user message?

Use system for persistent behavioural instructions — tone, constraints, persona, response format. Use user for request-specific context — the document, the question, the data being processed. Mixing them leads to prompts that are hard to maintain and harder to test. A practical rule: if the instruction would be the same across every request in a feature, it belongs in system. If it changes per request, it belongs in messages. This separation also makes caching more effective, since stable system prompts are candidates for Anthropic's server-side prompt caching.

How do I handle PHP memory limits when processing large documents?

Two approaches, used in combination. First, apply the TokenEstimator check before any Claude call and reject documents above your threshold — 12,000 input tokens is a reasonable first guardrail. Second, never load document content into memory inside an HTTP request. Large document processing belongs in a queued job where you control the worker's memory ceiling via memory_limit in your queue worker configuration. If documents routinely exceed your threshold, the correct architectural response is chunking — split the document, summarise each chunk independently, then synthesise. That's the natural path toward a RAG implementation.

How do I isolate API keys in a multi-tenant application?

Don't use a single Anthropic key for all tenants unless you're comfortable with one tenant's usage affecting another's rate limits and all tenants sharing a single cost bucket. The cleaner model is per-tenant keys stored encrypted in your database, resolved at the service layer rather than from config. Your ClaudeClient can accept a key parameter rather than reading blindly from config('claude.api_key'). This also gives you per-tenant cost visibility and the ability to suspend a single tenant's AI access without affecting others.

How do I test streaming endpoints?

You can't meaningfully test true HTTP streaming in a PHPUnit context — the response stream doesn't behave the same way in a test environment. The correct approach is to test the two concerns separately. Test the streaming logic in ClaudeClient by providing a mock stream body and asserting that $onChunk is called with the expected text fragments. Test the controller endpoint separately with a feature test that asserts the response has the correct Content-Type: text/event-stream header and a non-empty body. Don't try to assert streamed output token by token in an integration test — that way lies brittle, slow tests that break on whitespace changes.

When should I move from summary-based Q&A to full RAG?

When users start reporting that answers are missing information they can see in the document. Summary-based Q&A compresses the document into a fixed representation — inevitably lossy. RAG retrieves relevant chunks at query time, so the right information is always available in context. The practical trigger for the migration is usually a combination of document length exceeding a few thousand words, users asking narrow factual questions rather than broad summary questions, and answer accuracy becoming a measurable problem. The good news is that the Claude service layer doesn't change — only the retrieval mechanism evolves. Start with full-text search as described in the RAG section above, then introduce embeddings when search relevance becomes the bottleneck.

How do I handle Claude timeouts gracefully without leaving the user hanging?

Two failure modes to design for. For synchronous requests: set a conservative HTTP timeout in config('claude.timeout'), catch the ConnectionException or TimeoutException, and return a clean 503 with a user-facing message. Don't let the timeout bubble up as a 500. For queued jobs: set public int $timeout on the job class to exceed your worst-case Claude latency — 90 to 120 seconds is reasonable — and implement public function failed(\Throwable $e) to update the relevant model's status so the UI can reflect the failure rather than showing a perpetual loading state.

Can I cache conversational responses?

Technically yes, practically almost never. Conversation output is highly context-dependent — the same question asked at turn three of a conversation produces a different answer than at turn seven. Caching conversational output correctly requires keying on the full message history, which produces near-zero cache hit rates and pollutes your cache store with single-use entries. Reserve caching for deterministic, prompt-stable operations: document summaries, content classification, static content generation. If you're looking to reduce costs on a conversational feature, focus on history truncation and summarisation strategies to control input token growth rather than caching the output.

Is it safe to expose Claude responses directly to users, or should I sanitise output?

Always treat Claude output as untrusted input to your application layer. For UI rendering, run output through your templating engine's escaping — never render raw Claude text as unescaped HTML. For structured data extraction — when you're asking Claude to return JSON — always validate the parsed result against a schema before persisting or acting on it. Claude's output is highly reliable but not guaranteed to match a structure precisely on every request, especially under unusual input conditions. Your application's data integrity cannot depend on an LLM being consistent.

Laravel Claude API Integration: Production Guide

🕒 10 Minute Read 📅 Date Published: February 5, 2026

Laravel Claude API integration is where most tutorials stop at “it works” and ship. Then you add a second feature, traffic doubles, and you discover the first call was the easy part. Costs drift upward silently. A 429 at 2 AM takes down a queue. Streaming breaks behind your load balancer. Your tests pass while your prompts quietly regress.

This guide doesn’t do that. We build a Claude integration in Laravel 11/12 that handles real load—structured service layers, proper token accounting, streaming that actually works, background job processing, and tests that validate your assumptions about Claude rather than a fantasy mock. The contract-based service architecture that governs provider abstraction and AI governance at scale is covered in the Production-Grade AI Architecture in Laravel guide.

If you are still weighing Claude against OpenAI or Gemini before committing, our Laravel AI integration architecture guide breaks down the provider comparison and decision framework at the system level. This guide assumes you have made the call and covers the full Claude implementation path.

Stack: Laravel 11/12, Anthropic Messages API, raw HTTP. No SDK.

Foundation

Configuration

Claude uses API key authentication. Simple. No token refresh, no OAuth dance. That simplicity still requires discipline.

# .env
CLAUDE_API_KEY=sk-ant-...
CLAUDE_MODEL=claude-sonnet-4-6
CLAUDE_API_VERSION=2023-06-01
CLAUDE_MAX_TOKENS=1024
CLAUDE_TIMEOUT=30

// config/claude.php
return [
    'api_key'    => env('CLAUDE_API_KEY'),
    'model'      => env('CLAUDE_MODEL', 'claude-sonnet-4-6'),
    'version'    => env('CLAUDE_API_VERSION', '2023-06-01'),
    'base_url'   => 'https://api.anthropic.com/v1/messages',
    'max_tokens' => (int) env('CLAUDE_MAX_TOKENS', 1024),
    'timeout'    => (int) env('CLAUDE_TIMEOUT', 30),
];

Never hardcode the model string in application code. Anthropic retires model names more frequently than you’d expect — claude-sonnet-4-6 is already retired as of early 2026. Driving the model from config means a one-line .env change at deploy time. Check Anthropic’s models reference before every significant deploy.

A Typed Response DTO

Before any service class: give Claude’s response a shape your application can depend on.

// app/Services/Claude/ClaudeResponse.php
namespace App\Services\Claude;

final readonly class ClaudeResponse
{
    public function __construct(
        public string $text,
        public int    $inputTokens,
        public int    $outputTokens,
        public string $model,
    ) {}

    public function totalTokens(): int
    {
        return $this->inputTokens + $this->outputTokens;
    }
}

No arrays. No $response['content'][0]['text'] scattered across your codebase. One type. One parsing location.

The Claude Client

This class speaks Claude’s protocol. It knows nothing about your domain.

// app/Services/Claude/ClaudeClient.php
namespace App\Services\Claude;

use Illuminate\Http\Client\RequestException;
use Illuminate\Support\Facades\Http;
use RuntimeException;

final class ClaudeClient
{
    public function message(array $payload): ClaudeResponse
    {
        return retry(
            times: 3,
            callback: fn () => $this->send($payload),
            sleepMilliseconds: fn (int $attempt) => $attempt * 500,
            when: fn (\Throwable $e) => $this->isRetryable($e),
        );
    }

    private function send(array $payload): ClaudeResponse
    {
        $response = Http::withHeaders([
                'x-api-key'         => config('claude.api_key'),
                'anthropic-version' => config('claude.version'),
                'content-type'      => 'application/json',
            ])
            ->timeout(config('claude.timeout'))
            ->post(config('claude.base_url'), array_merge([
                'model'      => config('claude.model'),
                'max_tokens' => config('claude.max_tokens'),
            ], $payload));

        if ($response->status() === 429) {
            throw new RuntimeException('Claude rate limit exceeded', 429);
        }

        $response->throw();

        return $this->mapResponse($response->json());
    }

    private function mapResponse(array $data): ClaudeResponse
    {
        return new ClaudeResponse(
            text:         $data['content'][0]['text'] ?? '',
            inputTokens:  $data['usage']['input_tokens'] ?? 0,
            outputTokens: $data['usage']['output_tokens'] ?? 0,
            model:        $data['model'] ?? config('claude.model'),
        );
    }

    private function isRetryable(\Throwable $e): bool
    {
        if ($e instanceof RequestException) {
            return in_array($e->response->status(), [429, 529]);
        }
        return $e->getCode() === 429;
    }
}

The retry lambda targets 429 and 529 specifically. You don’t want to retry a 400 validation error three times — that wastes tokens and time. Exponential backoff is intentional: Anthropic’s rate limiters recover faster when you give them breathing room.

Register it as a singleton in AppServiceProvider::boot() — not RouteServiceProvider, which no longer exists in Laravel 11:

$this->app->singleton(ClaudeClient::class);

Core Integration Patterns

Pattern 1: Simple Request–Response

Don’t call ClaudeClient from controllers. Wrap each use case in a domain-specific service:

// app/Services/Claude/ArticleSummarizer.php
final class ArticleSummarizer
{
    public function __construct(
        private readonly ClaudeClient $claude
    ) {}

    public function summarize(string $article): array
    {
        $response = $this->claude->message([
            'system'   => 'You are a precise summarisation assistant. Return factual, concise summaries only.',
            'messages' => [
                ['role' => 'user', 'content' => "Summarise this article:\n\n{$article}"],
            ],
            'max_tokens' => 400,
        ]);

        return [
            'summary' => $response->text,
            'tokens'  => [
                'input'  => $response->inputTokens,
                'output' => $response->outputTokens,
            ],
        ];
    }
}

The controller stays thin:

public function store(Request $request, ArticleSummarizer $summarizer): JsonResponse
{
    try {
        $result = $summarizer->summarize($request->input('content'));
        return response()->json($result);
    } catch (\Throwable $e) {
        report($e);
        return response()->json(['error' => 'AI service temporarily unavailable'], 503);
    }
}

Controllers orchestrate. Services define how Claude is used. The client handles transport. Keep these boundaries clean and you’ll never have to grep your codebase to find where api_key is used.

Pattern 2: Streaming Responses

The article you’ve probably read implements streaming using ->onChunk(). That method doesn’t exist on Laravel’s HTTP client. Here’s what actually works:

// app/Services/Claude/ClaudeClient.php
public function stream(array $payload, callable $onChunk): void
{
    $response = $this->buildRequest()
        ->withOptions([
            'stream'          => true,
            'connect_timeout' => 5,
        ])
        ->post(config('claude.base_url'), array_merge([
            'model'      => config('claude.model'),
            'max_tokens' => config('claude.max_tokens'),
            'stream'     => true,
        ], $payload));

    $body = $response->toPsrResponse()->getBody();

    while (! $body->eof()) {
        $chunk = $body->read(1024);
        if ($chunk !== '') {
            $onChunk($chunk);
        }
    }
}

private function buildRequest(): \Illuminate\Http\Client\PendingRequest
{
    return Http::withHeaders([
        'x-api-key'         => config('claude.api_key'),
        'anthropic-version' => config('claude.version'),
        'content-type'      => 'application/json',
    ])->timeout(config('claude.timeout'));
}

The controller:

public function generate(Request $request): StreamedResponse
{
    return response()->stream(function () use ($request) {
        $this->claude->stream([
            'messages' => [
                ['role' => 'user', 'content' => $request->input('prompt')],
            ],
        ], function (string $chunk) {
            echo $chunk;
            flush();
        });
    }, 200, [
        'Content-Type'    => 'text/event-stream',
        'Cache-Control'   => 'no-cache',
        'X-Accel-Buffering' => 'no',
    ]);
}

The X-Accel-Buffering: no header is non-negotiable if you’re behind Nginx. Without it, your entire stream buffers server-side and the user sees nothing until completion. Verify your reverse proxy config before shipping any streaming endpoint.

The complete working chatbot implementation — conversation memory, real-time token rendering, and a Livewire frontend — lives in Building a Claude Chatbot with Streaming in Laravel. This section is the architectural foundation for that one.

Pattern 3: Conversation State

Claude is stateless. Full stop. Conversation memory is always an application concern, never Claude’s.

For the full Eloquent-backed implementation — token-aware history pruning, message persistence, and a ClaudeService wired through the Service Container — see the guide on Laravel Claude chatbot conversation memory.

final class ConversationResponder
{
    public function __construct(
        private readonly ClaudeClient $claude
    ) {}

    public function respond(array $history, string $input): ClaudeResponse
    {
        return $this->claude->message([
            'messages' => [
                ...$history,
                ['role' => 'user', 'content' => $input],
            ],
            'max_tokens' => 600,
        ]);
    }
}

Where history comes from is a strategic choice: session storage for ephemeral chats, Eloquent for persistent conversations with audit trails, Redis for fast transient workflows. The client doesn’t care. Truncate aggressively — uncapped history is a token-cost time bomb.

[Production Pitfall] A system role message does not belong inside the messages array. The Anthropic API will return a 400. System instructions go in a top-level system key alongside messages. If you’re copying the multi-turn example from the original article, you have a bug.

Pattern 4: Background Processing

Document ingestion, batch analysis, email drafts — anything the user isn’t waiting on goes into a queue.

final class AnalyzeDocument implements ShouldQueue
{
    use Dispatchable, Queueable, InteractsWithQueue;

    public int $tries  = 3;
    public int $timeout = 120; // exceed worst-case Claude latency

    public function __construct(
        private readonly int $documentId
    ) {}

    public function handle(ClaudeClient $claude): void
    {
        $document = Document::findOrFail($this->documentId);

        $response = $claude->message([
            'system'   => 'Extract structured data in JSON format only.',
            'messages' => [
                ['role' => 'user', 'content' => "Extract from this document:\n\n{$document->content}"],
            ],
            'max_tokens' => 1000,
        ]);

        $document->update([
            'analysis'      => $response->text,
            'tokens_used'   => $response->totalTokens(),
        ]);
    }

    public function failed(\Throwable $exception): void
    {
        report($exception);
        Document::find($this->documentId)?->markAnalysisFailed();
    }
}

AI workers should run on an isolated queue with a worker --timeout that exceeds max_tokens worst-case latency. Sharing a queue with fast database jobs means one slow Claude request blocks everything behind it.

Token Accounting

Tokens are cost. If you don’t measure them at the service boundary, you’ll measure them on your invoice.

Logging Usage

Schema::create('ai_usage_logs', function (Blueprint $table) {
    $table->id();
    $table->string('feature');
    $table->string('model');
    $table->unsignedInteger('input_tokens');
    $table->unsignedInteger('output_tokens');
    $table->nullableMorphs('subject'); // link to user, document, etc.
    $table->timestamps();
});

Log at the service boundary, not in the controller:

AiUsageLog::create([
    'feature'       => 'article_summary',
    'model'         => $response->model,
    'input_tokens'  => $response->inputTokens,
    'output_tokens' => $response->outputTokens,
    'subject_type'  => User::class,
    'subject_id'    => auth()->id(),
]);

This gives you cost per feature, cost per user, and early warning when a prompt change sends token usage up 40% overnight.

Pre-flight Estimation

You can’t know output tokens ahead of time. Input tokens are deterministic:

final class TokenEstimator
{
    // ~4 chars per token is a reliable heuristic for English prose
    public static function estimateInput(string $text): int
    {
        return (int) ceil(mb_strlen($text) / 4);
    }
}

Gate expensive calls before they happen:

$estimated = TokenEstimator::estimateInput($document->content);

if ($estimated > 150_000) {
    throw new DomainException('Document exceeds single-request token budget. Chunk it first.');
}

The middleware layer that enforces per-user token budgets and throttle policies sits one level above this — Laravel AI Middleware: Token Tracking & Rate Limiting covers that full stack, and it pairs directly with the ClaudeResponse DTO we introduced above.

Document Q&A

A complete workflow: upload a document, pre-process it, support follow-up questions.

Step 1: Storage (No AI on Upload)

Schema::create('documents', function (Blueprint $table) {
    $table->id();
    $table->string('title');
    $table->longText('content');
    $table->text('summary')->nullable();
    $table->enum('status', ['pending', 'ready', 'failed'])->default('pending');
    $table->timestamps();
});

Don’t involve Claude on upload. Upload failures and AI failures are different failure modes. Keep them separate. AI processing is deferred to a job.

Step 2: Pre-Summarisation Job

final class PrepareDocumentSummary implements ShouldQueue
{
    use Dispatchable, Queueable;

    public int $tries = 2;

    public function __construct(private readonly int $documentId) {}

    public function handle(ClaudeClient $claude): void
    {
        $document = Document::findOrFail($this->documentId);

        $response = $claude->message([
            'system'   => 'Create a concise, factual summary. Preserve key claims, figures, and named entities.',
            'messages' => [
                ['role' => 'user', 'content' => $document->content],
            ],
            'max_tokens' => 800,
        ]);

        $document->update([
            'summary' => $response->text,
            'status'  => 'ready',
        ]);
    }
}

Step 3: Q&A Against the Summary

Don’t resend the full document on every question. Use the summary:

final class DocumentQuestionAnswerer
{
    public function __construct(
        private readonly ClaudeClient $claude
    ) {}

    public function answer(Document $document, string $question, array $history = []): string
    {
        $messages = [
            ...$history,
            ['role' => 'user', 'content' => $question],
        ];

        $response = $this->claude->message([
            'system'   => "You answer questions based solely on this document summary:\n\n{$document->summary}",
            'messages' => $messages,
            'max_tokens' => 500,
        ]);

        return $response->text;
    }
}

This is intentionally conservative. It reduces hallucination risk and creates a clear upgrade path to RAG. When the summary stops being sufficient, you add chunked retrieval — you don’t rewrite the service.

Production Concerns

Rate Limiting

Rate limits aren’t edge cases. In production, they’re a steady-state constraint. Treat them as backpressure, not exceptions.

The retry logic in ClaudeClient handles transient 429s. That’s your last line of defence. The first line is shaping traffic before it reaches Claude:

// app/Providers/AppServiceProvider.php
RateLimiter::for('claude', function (Request $request) {
    return [
        Limit::perMinute(30)->by($request->user()?->id ?? $request->ip()),
        Limit::perDay(500)->by($request->user()?->id ?? $request->ip()),
    ];
});

Route::post('/ai/generate')
    ->middleware(['auth', 'throttle:claude'])
    ->uses([AiController::class, 'generate']);

If you’re already hitting Anthropic’s tier limits under load, How to Handle Claude API Rate Limits in Production goes well beyond basic retry logic — backpressure handling, queue-based throttling, and what to monitor before a 429 cascade becomes an outage.

Caching

Caching LLM responses is primarily a stability mechanism. The correctness question is: what makes two requests “the same”? Get this wrong and you serve stale output without knowing it.

$cacheKey = 'claude:v2:' . hash('sha256', json_encode([
    'model'    => config('claude.model'),
    'system'   => $systemPrompt,
    'messages' => $messages,
]));

return Cache::remember($cacheKey, now()->addHours(6), function () use ($payload) {
    $response = $this->claude->message($payload);

    // Fallback persistence — if Redis is flushed, don't re-pay
    StoredAiResponse::updateOrCreate(
        ['cache_key' => $cacheKey],
        ['response_text' => $response->text]
    );

    return $response;
});

Include the model name in the key. A model upgrade otherwise contaminates cached responses from the previous version with no warning. When prompts change significantly, bump the version prefix (v2 → v3) and accept the temporary re-generation cost.

[Efficiency Gain] For summaries and classification results, TTL should be measured in days, not hours. These are expensive to generate and change only when the source content changes. If you find yourself setting a 6-hour TTL on a document summary, you’re paying for the same generation four times per day.

Testing

Layer 1: Contract Tests

Test the protocol boundary. Headers, payload shape, 429 handling, response parsing:

Http::fake(function (\Illuminate\Http\Client\Request $request) {
    expect($request->header('x-api-key'))->not->toBeNull();
    expect($request->data()['model'])->toBe(config('claude.model'));
    expect($request->data()['messages'])->toBeArray();

    return Http::response([
        'content' => [['type' => 'text', 'text' => 'OK']],
        'usage'   => ['input_tokens' => 10, 'output_tokens' => 5],
        'model'   => config('claude.model'),
    ]);
});

This fails loudly when headers disappear during a refactor or when a model name change doesn’t propagate through the config.

Layer 2: Prompt Construction

Prompt logic is application logic. Test it:

$result = PromptBuilder::for('document_qna')
    ->withSystem('Answer from the document only.')
    ->withContext('Document content...')
    ->withUser('What is the main argument?')
    ->build();

expect($result['messages'])->toMatchSnapshot();

Snapshot tests work well here. If the snapshot changes, someone must approve it. Prompt drift is real — without tests at this layer, you discover it when costs spike or output quality degrades.

Layer 3: Recorded Responses

Don’t hand-write fixture JSON. Record real Claude responses once, replay them:

Http::fake([
    'api.anthropic.com/*' => Http::response(
        json_decode(file_get_contents(base_path('tests/fixtures/claude/document_qna.json')), true)
    ),
]);

When prompts change meaningfully, update the fixture deliberately and document why. The fixture is your prompt regression baseline.

Layer 4: Behavioural Assertions

Never assert exact output text. Assert properties:

$result = $service->summarize($document);

expect($result['summary'])
    ->toBeString()
    ->toHaveLength(greaterThan(200));

expect($result['tokens']['input'])->toBeGreaterThan(0);

This validates what the user cares about, not which tokens Claude happened to emit.

Advanced Patterns

Livewire Integration

Livewire’s strength is incremental server-driven updates, not raw HTTP streams. Map Claude’s streaming output to progressive component refreshes:

class AiAssistant extends Component
{
    public string $prompt   = '';
    public string $response = '';
    public bool   $loading  = false;

    public function submit(ClaudeClient $claude): void
    {
        $this->loading  = true;
        $this->response = '';

        $claude->stream(
            ['messages' => [['role' => 'user', 'content' => $this->prompt]]],
            function (string $chunk) {
                $this->response .= $chunk;
                $this->dispatch('$refresh');
            }
        );

        $this->loading = false;
    }
}

ClaudeClient knows nothing about Livewire. The component knows nothing about Claude’s API. You can replace either without touching the other.

Inertia + Vue (SSE)

With a stateful frontend, expose streaming via Server-Sent Events:

public function stream(Request $request, ClaudeClient $claude): StreamedResponse
{
    return response()->stream(function () use ($request, $claude) {
        $claude->stream(
            ['messages' => [['role' => 'user', 'content' => $request->input('prompt')]]],
            function (string $chunk) {
                echo "data: {$chunk}\n\n";
                ob_flush();
                flush();
            }
        );

        echo "data: [DONE]\n\n";
        ob_flush();
        flush();
    }, 200, [
        'Content-Type'      => 'text/event-stream',
        'Cache-Control'     => 'no-cache',
        'X-Accel-Buffering' => 'no',
    ]);
}

From Vue’s perspective, this is a standard EventSource. The [DONE] sentinel tells the client when to stop listening.

RAG Basics

You don’t need a vector database to start. MySQL full-text search or Meilisearch with a structured prompt covers most internal tools and knowledge bases:

$documents = Document::search($query)->take(5)->get();

$context = $documents
    ->map(fn ($d) => $d->content)
    ->implode("\n\n---\n\n");

$response = $claude->message([
    'system'   => 'Answer using only the provided context. If the context does not contain the answer, say so.',
    'messages' => [
        ['role' => 'user', 'content' => "Context:\n{$context}\n\nQuestion:\n{$query}"],
    ],
    'max_tokens' => 600,
]);

When this stops meeting your accuracy requirements, you add embeddings. The ClaudeClient doesn’t change — only the retrieval layer evolves.

[Architect’s Note] RAG failures almost always come from coupling retrieval logic with prompt construction. The moment you start building conditionals in prompt strings based on what the retriever found, you’ve lost the plot. Retrieval produces context. Prompt construction assembles it. These are separate responsibilities.

Multi-Model Architecture

Claude is not a universal solution. It excels at long-form reasoning, structured output, and safe handling of ambiguous prompts. For cheap high-volume classification or low-latency embeddings, you want something else.

interface AiModel
{
    public function complete(array $input): AiResponse;
}

Implement ClaudeModel, EmbeddingModel, LightweightClassifier. Your application code depends on the interface. This lets you route by cost or latency, swap providers during an outage, and introduce new models without touching business logic.

If you’re building a provider-agnostic system that spans Claude and GPT-4o, The Complete Guide to Integrating OpenAI with Laravel covers the abstraction patterns in detail.

When Not to Use Claude

Don’t use Claude for deterministic transformations, validation rules, or anything that must be 100% correct. AI output is a suggestion, not ground truth. The healthiest integrations treat it as input to further processing. If you find yourself trusting Claude’s output more than your domain code, that’s a design smell — and a support ticket waiting to happen.

What to Build Next

This guide took you from raw API authentication to production-grade token accounting, streaming, background jobs, and testing. The logical next steps depend on where your current bottleneck is.

Building your first interactive feature? Building a Claude Chatbot with Streaming in Laravel walks through a complete working chatbot with conversation memory and real-time streaming — a direct application of the patterns covered here.

Hitting rate limits under load? How to Handle Claude API Rate Limits in Production covers retry strategies, backpressure handling, and queue-based throttling in depth. The patterns map directly to Laravel’s job system.

Prefer server-driven UI over JavaScript? Laravel Livewire + Claude Integration shows how to build streaming AI features without touching a JavaScript framework.

Questions? Ask in our Developer Q&A or contact me directly.

This guide was featured on Laravel News in February 2026.

Frequently Asked Questions

Do I need Anthropic’s official SDK or is raw HTTP sufficient for production?

Raw HTTP is the right call for PHP. Anthropic doesn’t maintain an official PHP SDK, and the third-party options available are either unmaintained or thin wrappers that add a dependency without adding meaningful value. Laravel’s HTTP client (backed by Guzzle) gives you everything you need — retry logic, timeout control, response handling — and keeps the integration entirely within tooling your team already understands and debugs. The one exception is streaming, where you drop to raw Guzzle directly, as covered in the streaming section above.
How accurate is the 4-characters-per-token estimation?

Accurate enough to be a useful guardrail, not accurate enough to be a billing tool. English prose tends to run slightly under 4 characters per token. Code, JSON, and non-Latin scripts can skew significantly. For pre-flight checks — catching runaway document sizes before they hit the API — the heuristic is fine. For cost forecasting, use the actual input_tokens and output_tokens values from Claude’s response and track them in ai_usage_logs over time. Real usage data will always beat a formula.
Should I use the system prompt or inject context directly into the user message?

Use system for persistent behavioural instructions — tone, constraints, persona, response format. Use user for request-specific context — the document, the question, the data being processed. Mixing them leads to prompts that are hard to maintain and harder to test. A practical rule: if the instruction would be the same across every request in a feature, it belongs in system. If it changes per request, it belongs in messages. This separation also makes caching more effective, since stable system prompts are candidates for Anthropic’s server-side prompt caching.
How do I handle PHP memory limits when processing large documents?

Two approaches, used in combination. First, apply the TokenEstimator check before any Claude call and reject documents above your threshold — 12,000 input tokens is a reasonable first guardrail. Second, never load document content into memory inside an HTTP request. Large document processing belongs in a queued job where you control the worker’s memory ceiling via memory_limit in your queue worker configuration. If documents routinely exceed your threshold, the correct architectural response is chunking — split the document, summarise each chunk independently, then synthesise. That’s the natural path toward a RAG implementation.
How do I isolate API keys in a multi-tenant application?

Don’t use a single Anthropic key for all tenants unless you’re comfortable with one tenant’s usage affecting another’s rate limits and all tenants sharing a single cost bucket. The cleaner model is per-tenant keys stored encrypted in your database, resolved at the service layer rather than from config. Your ClaudeClient can accept a key parameter rather than reading blindly from config('claude.api_key'). This also gives you per-tenant cost visibility and the ability to suspend a single tenant’s AI access without affecting others.
How do I test streaming endpoints?

You can’t meaningfully test true HTTP streaming in a PHPUnit context — the response stream doesn’t behave the same way in a test environment. The correct approach is to test the two concerns separately. Test the streaming logic in ClaudeClient by providing a mock stream body and asserting that $onChunk is called with the expected text fragments. Test the controller endpoint separately with a feature test that asserts the response has the correct Content-Type: text/event-stream header and a non-empty body. Don’t try to assert streamed output token by token in an integration test — that way lies brittle, slow tests that break on whitespace changes.
When should I move from summary-based Q&A to full RAG?

When users start reporting that answers are missing information they can see in the document. Summary-based Q&A compresses the document into a fixed representation — inevitably lossy. RAG retrieves relevant chunks at query time, so the right information is always available in context. The practical trigger for the migration is usually a combination of document length exceeding a few thousand words, users asking narrow factual questions rather than broad summary questions, and answer accuracy becoming a measurable problem. The good news is that the Claude service layer doesn’t change — only the retrieval mechanism evolves. Start with full-text search as described in the RAG section above, then introduce embeddings when search relevance becomes the bottleneck.
How do I handle Claude timeouts gracefully without leaving the user hanging?

Two failure modes to design for. For synchronous requests: set a conservative HTTP timeout in config('claude.timeout'), catch the ConnectionException or TimeoutException, and return a clean 503 with a user-facing message. Don’t let the timeout bubble up as a 500. For queued jobs: set public int $timeout on the job class to exceed your worst-case Claude latency — 90 to 120 seconds is reasonable — and implement public function failed(\Throwable $e) to update the relevant model’s status so the UI can reflect the failure rather than showing a perpetual loading state.
Can I cache conversational responses?

Technically yes, practically almost never. Conversation output is highly context-dependent — the same question asked at turn three of a conversation produces a different answer than at turn seven. Caching conversational output correctly requires keying on the full message history, which produces near-zero cache hit rates and pollutes your cache store with single-use entries. Reserve caching for deterministic, prompt-stable operations: document summaries, content classification, static content generation. If you’re looking to reduce costs on a conversational feature, focus on history truncation and summarisation strategies to control input token growth rather than caching the output.
Is it safe to expose Claude responses directly to users, or should I sanitise output?

Always treat Claude output as untrusted input to your application layer. For UI rendering, run output through your templating engine’s escaping — never render raw Claude text as unescaped HTML. For structured data extraction — when you’re asking Claude to return JSON — always validate the parsed result against a schema before persisting or acting on it. Claude’s output is highly reliable but not guaranteed to match a structure precisely on every request, especially under unusual input conditions. Your application’s data integrity cannot depend on an LLM being consistent.

Dewald Hugo

A software architect with 15+ years of experience in the PHP and Laravel ecosystem. Dewald created Origin Main to provide the engineering rigour required to integrate AI into professional, high-concurrency production systems. He writes for developers who care less about "getting it to work" and more about "getting it to last".