PRODUCTION-GRADE AI ARCHITECTURETHE AGENTIC STACK (ADVANCED IMPLEMENTATION)

Building Agentic Laravel Apps with Prism PHP

Stack: Laravel 13, PHP 8.2+, PostgreSQL 16+ with pgvector, Prism PHP

Laravel 13 ships a capable first-party AI SDK (laravel/ai) covering text generation, embeddings, and basic completions across OpenAI, Gemini, and Anthropic. For most prompt-to-text workflows, it is the right starting point and the right long-term answer. Prism PHP is the layer you reach for when your requirements move past that baseline: multi-provider tool calling, agentic loop control with step limits, RAG pipelines with pgvector, and SSE streaming across providers that laravel/ai does not yet expose at full depth. These are complementary tools, not competing ones.

This guide covers the agentic side of that boundary. Everything here sits inside the broader Laravel AI architecture module, which frames how these implementation decisions relate to one another across the full stack.

laravel/ai vs Prism PHP: Knowing Which Tool to Reach For

Before adding Prism PHP to your dependency list, be clear on what you are solving for. The Laravel 13 AI features breakdown covers the first-party SDK in full. The short version:

Featurelaravel/aiPrism PHP
Text generation
Embeddings
Tool callingLimited✓ Full
Agentic loop control (withMaxSteps)
Multi-provider supportOpenAI, Gemini, Anthropic+ Ollama, Mistral
SSE streaming (unified across providers)Provider-dependent
Broadcast via WebSocket (asBroadcast)

Single provider, no tool calling, straightforward prompt-to-text: use laravel/ai. Agents that call real APIs, route across providers, or stream to a frontend: Prism PHP closes those gaps. The contract-based abstraction approach that makes provider-swapping deterministic in production applies regardless of which layer sits underneath.

Installation

composer require prism-php/prism
php artisan vendor:publish --tag=prism-config

This drops a config/prism.php file. Source your provider API keys from .env and reference them via config. If you plan to register a shared Prism instance through Laravel’s Service Container, your singleton binding goes in bootstrap/app.php:

// bootstrap/app.php
->withProviders([
    App\Providers\AiServiceProvider::class,
])

Keep provider resolution out of controllers. Bind once, inject everywhere.

Multi-Provider Configuration

Hardcoding a single AI provider is a production risk. OpenAI has outages. Anthropic rate-limits at peak. A single config change should be enough to swap providers without touching application code.

OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-api03-...
GEMINI_API_KEY=...

Switching provider is one line:

Prism::text()->using('anthropic', 'claude-sonnet-4-6')
Prism::text()->using('openai', 'gpt-4o')
Prism::text()->using('gemini', 'gemini-2.0-flash')

[Architect’s Note] Build against a provider-agnostic interface from day one. Store the active provider and model string in config, not scattered across call sites. When you need to run a cost comparison between claude-sonnet-4-6 and gpt-4o in production, you will thank yourself for having one place to change. The Laravel AI service layer guide covers the interface pattern that makes this clean.

Basic Text Generation

use Prism\Prism\Facades\Prism;

$response = Prism::text()
    ->using('anthropic', 'claude-sonnet-4-6')
    ->withMaxTokens(1024)
    ->withSystemPrompt('You are a luxury fashion consultant with access to our product catalogue.')
    ->withPrompt($userMessage)
    ->asText();

return $response->text;

asText() blocks until the full response arrives. Use it for short, server-side generation tasks. Stream anything user-facing.

[Production Pitfall] Not setting withMaxTokens() means Prism defers to the provider default, which for some models is the full context window. On a busy application, a handful of runaway completions can consume a significant portion of your monthly budget in minutes. Set an explicit ceiling on every call and monitor token consumption via your middleware layer.

Building a Tool-Calling Agent with Prism PHP

Agents in Prism are text requests that include tools. The model decides when to invoke a tool, Prism executes the closure, injects the result back into the conversation, and the loop continues until the model reaches a final answer or withMaxSteps() fires. The diagram below shows the full cycle.

Prism PHP agentic loop Flow diagram: user prompt enters Prism PHP, which calls an LLM. The LLM decides whether to invoke a tool. If yes, the tool executes and returns a result that loops back to the LLM, repeating up to five times. If no tool call is needed, the final response is returned to the user. User prompt Prism PHP Request orchestrator LLM inference claude-sonnet-4-6 Tool call? yes Tool execution Stripe / Twilio Tool result no Final response User

The Stripe Refund Tool

use Prism\Prism\Facades\Tool;
use Stripe\StripeClient;

$refundTool = Tool::as('process_refund')
    ->for('Refund the most recent charge for a customer by their email address.')
    ->withStringParameter('email', 'The customer email address.')
    ->using(function (string $email): string {
        try {
            $stripe = new StripeClient(config('services.stripe.secret'));

            $customer = $stripe->customers->all(['email' => $email, 'limit' => 1])->data[0] ?? null;
            if (!$customer) {
                return 'Error: Customer not found.';
            }

            $charge = $stripe->charges->all(['customer' => $customer->id, 'limit' => 1])->data[0] ?? null;
            if (!$charge) {
                return 'Error: No charges found for this account.';
            }

            $stripe->refunds->create(['charge' => $charge->id]);

            return "Success: Refunded charge {$charge->id} for {$email}.";

        } catch (\Stripe\Exception\ApiErrorException $e) {
            \Log::error('Stripe refund failed', ['email' => $email, 'error' => $e->getMessage()]);
            return 'Error: Payment processor unavailable. Please try again later.';
        }
    });

[Production Pitfall] This tool executes a live financial transaction. As written, it fires without any human confirmation step. Never let an LLM trigger a refund autonomously in production. Wire in a confirmation queue, a webhook callback, or at minimum a human-approval flag before this runs against real Stripe credentials.

The Twilio SMS Tool

use Prism\Prism\Facades\Tool;
use Twilio\Rest\Client;

$smsTool = Tool::as('send_sms')
    ->for('Send an SMS notification to a phone number.')
    ->withStringParameter('to', 'Recipient phone number in E.164 format.')
    ->withStringParameter('message', 'The message body.')
    ->using(function (string $to, string $message): string {
        try {
            $twilio = new Client(
                config('services.twilio.sid'),
                config('services.twilio.token')
            );

            $twilio->messages->create($to, [
                'from' => config('services.twilio.from'),
                'body' => $message,
            ]);

            return "SMS sent to {$to}.";

        } catch (\Twilio\Exceptions\RestException $e) {
            \Log::error('Twilio SMS failed', ['to' => $to, 'error' => $e->getMessage()]);
            return 'Error: Could not deliver SMS. Notification skipped.';
        }
    });

Running the Agent

$response = Prism::text()
    ->using('anthropic', 'claude-sonnet-4-6')
    ->withMaxTokens(2048)
    ->withSystemPrompt('You are a customer support agent. You can process refunds and send SMS notifications.')
    ->withTools([$refundTool, $smsTool])
    ->withMaxSteps(5)
    ->withPrompt($userMessage)
    ->asText();

return $response->text;

withMaxSteps(5) is your safety valve. It caps the agent loop at five tool-call rounds before Prism forces a final answer. For most support workflows, that is more than enough. At step counts above ten, production teams consistently see token costs spike when the model enters retry loops on tool failures. Bump it only with a verified reason.

RAG with pgvector

Standard LLMs hallucinate on data they were not trained on. RAG solves this by pulling semantically relevant chunks from your own database and injecting them into the prompt before the model sees the question. The conceptual foundation, how embeddings work and why pgvector is the right store for production RAG, is in the Laravel embeddings and vector database guide. For direct Anthropic API integration without the Prism abstraction layer, covering raw HTTP, streaming, and token accounting, the Laravel Claude API integration guide covers that approach in full.

Step 1: Enable the Extension

-- Run once against your database
CREATE EXTENSION IF NOT EXISTS vector;

Step 2: Migration

OpenAI’s text-embedding-3-small outputs 1,536 dimensions. Your vector column must match exactly. Dimensions are not interchangeable across embedding models.

use Illuminate\Database\Migrations\Migration;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Support\Facades\Schema;
use Illuminate\Support\Facades\DB;

return new class extends Migration
{
    public function up(): void
    {
        DB::statement('CREATE EXTENSION IF NOT EXISTS vector');

        Schema::create('products', function (Blueprint $table) {
            $table->id();
            $table->string('name');
            $table->text('description');
            $table->string('category');
            $table->decimal('price', 10, 2);
            $table->timestamps();
        });

        // Blueprint does not support vector columns natively; add with a raw statement
        DB::statement('ALTER TABLE products ADD COLUMN embedding vector(1536)');

        // HNSW index for fast approximate nearest-neighbour search
        DB::statement(
            'CREATE INDEX products_embedding_idx ON products USING hnsw (embedding vector_cosine_ops)'
        );
    }
};

Step 3: Generating and Storing Embeddings

Wrap embedding generation in a dedicated service class. This makes it injectable, mockable, and swappable when you change embedding providers.

// app/Services/EmbeddingService.php

namespace App\Services;

use Illuminate\Support\Facades\Http;
use RuntimeException;

class EmbeddingService
{
    public function generate(string $text): array
    {
        $response = Http::withToken(config('services.openai.api_key'))
            ->timeout(30)
            ->retry(3, 1000)
            ->post('https://api.openai.com/v1/embeddings', [
                'model' => 'text-embedding-3-small',
                'input' => $text,
            ]);

        if ($response->failed()) {
            throw new RuntimeException(
                'Embedding generation failed: ' . $response->status()
            );
        }

        $embedding = $response->json('data.0.embedding');

        if (!is_array($embedding) || empty($embedding)) {
            throw new RuntimeException('Invalid embedding response from OpenAI.');
        }

        return $embedding;
    }
}

[Efficiency Gain] Embedding generation costs tokens and adds latency. Always dispatch it as a queued job, never inline during a web request. If you re-embed the same product after a minor edit, cache the embedding hash and skip the API call when the source text has not changed.

Dispatch the job rather than calling the service directly from your controller:

// Controller or action class
$product = Product::create([
    'name'        => $name,
    'description' => $description,
    'category'    => $category,
    'price'       => $price,
]);

GenerateProductEmbeddingJob::dispatch($product);
// app/Jobs/GenerateProductEmbeddingJob.php

public int $tries = 3;
public int $backoff = 60;

public function handle(EmbeddingService $embeddingService): void
{
    $text = "{$this->product->name} {$this->product->description} {$this->product->category}";

    $vector = $embeddingService->generate($text);

    DB::statement(
        'UPDATE products SET embedding = ?::vector WHERE id = ?',
        ['[' . implode(',', $vector) . ']', $this->product->id]
    );
}

Step 4: Similarity Search

$queryVector = $embeddingService->generate("What are your best waterproof jackets?");
$vectorString = '[' . implode(',', $queryVector) . ']';

$products = DB::select(
    "SELECT id, name, description, price,
            embedding <=> ?::vector AS distance
     FROM products
     ORDER BY distance
     LIMIT 3",
    [$vectorString]
);

[Edge Case Alert] The <=> operator performs cosine distance: lower values are closer. If results look semantically wrong, verify you are using vector_cosine_ops in your HNSW index, not vector_l2_ops. Mixing operators and index types silently falls back to a sequential scan on large tables, which destroys query performance without surfacing an error.

Step 5: Inject Context into the Prompt

$context = collect($products)
    ->map(fn($p) => "{$p->name}: {$p->description} | R{$p->price}")
    ->implode("\n");

$response = Prism::text()
    ->using('anthropic', 'claude-sonnet-4-6')
    ->withMaxTokens(1024)
    ->withSystemPrompt('You are a product assistant. Use only the provided product context to answer questions. Do not speculate beyond it.')
    ->withPrompt("Context:\n{$context}\n\nQuestion: What are your best waterproof jackets?")
    ->asText();

Constraining the model to the provided context is not just a prompt best practice. It is the operational difference between a useful RAG system and a confidently wrong one.

Real-Time Streaming with SSE

asEventStreamResponse() returns a streamed HTTP response over Server-Sent Events. Before choosing SSE, read the breakdown of Livewire, SSE, and WebSockets for AI streaming transports. SSE is the right default for one-way AI output streams, but not every use case fits that pattern.

Backend

use Prism\Prism\Facades\Prism;
use Illuminate\Http\Request;

Route::get('/chat/stream', function (Request $request) {
    $validated = $request->validate([
        'message' => 'required|string|max:2000',
    ]);

    return Prism::text()
        ->using('anthropic', 'claude-sonnet-4-6')
        ->withMaxTokens(1024)
        ->withSystemPrompt('You are a helpful product assistant.')
        ->withPrompt($validated['message'])
        ->asEventStreamResponse();
});

The validate() call here is non-optional in production. Raw input to a prompt is an injection vector.

Frontend

The scoping issue in a naive implementation: this.output referenced inside a plain <script> block does not bind to the Alpine component. It points to window and silently swallows your streamed content. The corrected version below wraps everything in the Alpine component factory.

<div x-data="chatStream()">
    <textarea x-model="output" class="w-full h-64 p-4 bg-gray-100 rounded" readonly></textarea>
    <input x-model="message" type="text" placeholder="Ask something..."
           class="w-full p-2 border rounded mt-2" />
    <button @click="startStream()"
            class="mt-2 px-4 py-2 bg-blue-600 text-white rounded">Ask</button>
</div>

<script>
    function chatStream() {
        return {
            output: '',
            message: '',
            source: null,
            startStream() {
                if (this.source) this.source.close();
                this.output = '';
                this.source = new EventSource(
                    `/chat/stream?message=${encodeURIComponent(this.message)}`
                );
                this.source.onmessage = (event) => {
                    if (event.data === '[DONE]') {
                        this.source.close();
                        return;
                    }
                    try {
                        const parsed = JSON.parse(event.data);
                        this.output += parsed.text ?? '';
                    } catch (e) {
                        console.error('SSE parse error:', e);
                    }
                };
                this.source.onerror = () => {
                    this.source.close();
                };
            }
        };
    }
</script>

For pushing AI output to a specific authenticated user rather than a public GET endpoint, Prism also supports asBroadcast() via Laravel Reverb. That is the pattern to reach for when you need session-scoped streaming.

Caching Repeated Prompts

AI API calls are expensive. If you are generating the same output repeatedly (summarising a static policy document, answering a common FAQ), cache the result.

$policy = cache()->remember('ai.return_policy.v1', now()->addDay(), function () {
    return Prism::text()
        ->using('anthropic', 'claude-sonnet-4-6')
        ->withMaxTokens(512)
        ->withPrompt('Summarise our return policy in three sentences.')
        ->asText()
        ->text;
});

[Word to the Wise] Cache keying strategy matters more than most developers realise until they have served stale AI output to the wrong user. Only cache responses generated from prompts with no user-specific context. The moment you interpolate a user ID, a session variable, or any personalised data into the prompt, that response must never be shared across users. Tag your cache keys clearly. ai.policy.v1 is a better key than policy because it lets you bust the cache cleanly when your underlying document changes.

Operational Metrics You Should Track in Production

AI systems fail operationally long before they fail technically. Monitoring latency, tool reliability, token usage, and streaming performance becomes essential once AI workloads move into production.

Metric What It Measures Suggested Target
P50 Latency Median end-to-end AI request latency < 2.5s
P95 Latency Tail latency under peak load < 7s
Time To First Token (TTFT) How quickly streaming responses begin < 1.2s
Tool Call Success Rate Percentage of valid tool executions > 95%
Tokens Per Request Input + output token consumption Monitor Trends
Cache Hit Rate Requests served from cache > 30%
Cost Per 1K Requests Normalized AI infrastructure cost Continuously Optimize

Production AI systems should be treated like distributed infrastructure, not simple API integrations. Reliability, latency, observability, and cost management become architectural concerns very quickly.


References and Further Reading

  • Prism PHP Documentation – Official docs covering all providers, tool schemas, and streaming options.
  • OpenAI Embeddings API Reference – Model dimensions, pricing, and batching guidance for text-embedding-3-small and text-embedding-3-large.
  • Laravel Reverb – Laravel’s first-party WebSocket server, relevant if you move beyond SSE to bidirectional AI communication.

Frequently Asked Questions

When should I use laravel/ai instead of Prism PHP?

Use laravel/ai when your requirements are text generation, embeddings, or basic prompt-to-text completions with a single provider. It is the first-party, officially supported choice and the right long-term default for those cases. Reach for Prism PHP when you need full tool calling, agentic loop control, cross-provider routing, or asBroadcast() streaming — capabilities laravel/ai does not yet expose at the same depth.

What does withMaxSteps() actually protect against?

It caps the number of tool-call rounds Prism PHP will execute before forcing the model to return a final answer. Without it, a model that enters a retry loop on a failing tool (or that chains unnecessary tool calls) can consume your entire token budget on a single request. Five steps is a safe default for most support-style workflows.

Why use an HNSW index with vector_cosine_ops rather than the default?

pgvector’s default index type (IVFFlat) requires a rebuild to add new data accurately. HNSW supports incremental inserts cleanly, which matches how product data actually changes. The vector_cosine_ops operator class must match the <=> cosine distance operator used in your queries. A mismatch silently causes a sequential scan, which performs acceptably on small tables and catastrophically on large ones.

Can I use Prism PHP agents in a queued job?

Yes, and for anything involving external API calls (Stripe, Twilio, database writes), that is where they belong. Queued jobs give you retry logic, failure logging, and isolation from the HTTP request lifecycle. The $tries and $backoff properties on your job class handle transient provider errors without custom retry code.

What happens if the SSE connection drops mid-stream?

The browser’s EventSource API will automatically attempt to reconnect using the Last-Event-ID header. For production use, your backend should emit id: fields on each event so the client can resume. Without them, a reconnect starts the stream from scratch. The production SSE guide covers reconnect handling, timeouts, and multi-tenant event isolation in depth.

Dewald Hugo

A software architect with 15+ years of experience in the PHP and Laravel ecosystem. Dewald created Origin Main to provide the engineering rigour required to integrate AI into professional, high-concurrency production systems. He writes for developers who care less about "getting it to work" and more about "getting it to last".

Subscribe
Notify of
0 Comments
Oldest
Newest Most Voted
Quick Navigation
Scroll to Top