Laravel AI Agent Memory: Persisting Context

Q: Can I use a different embedding model for long-term memory?

Yes. The 1536-dimension column above targets OpenAI's text-embedding-3-small. If you switch to a different embedding model with a different output dimension, you will need to re-embed all existing memories. Plan for that before going to production. The embedding models covered in the Laravel embeddings implementation guide include dimension specs for the main options.

🕒 2 Minute Read 📅 Date Published: June 12, 2026

Every production Laravel AI agent memory system fails the same way the first time: the agent re-asks a question the user answered three turns ago. Not because the model forgot. Because the architecture never told it.

Stateless completion endpoints are the default in most Laravel AI tutorials. Send a prompt, get a response, discard the context. For single-turn tasks that is fine. For agentic workflows where the agent executes tools, accumulates intermediate results, and serves returning users, stateless design is an architectural failure, not a simplification.

Agent memory is not a single pattern. It is three distinct layers, each with different storage backends, different TTLs, and different injection points. If you are building anything beyond a throwaway chatbot demo, the full Laravel AI architecture covers how memory fits into a broader production system. This article builds all three layers from scratch using Eloquent, Redis, and pgvector.

Why Stateless Agents Break in Production

The failure modes are concrete and repeatable.

An agent without conversation memory will re-execute a tool call it already completed if the user asks a follow-up question. The prior result is gone. The agent has no choice but to run the tool again, burning time, credits, or both. If the tool has side effects (sending a notification, writing a record), you now have a duplicate.

An agent without session memory will ask the same clarifying questions every time the conversation resumes after a page refresh or reconnection. “Which project are you working on?” The user answered that four messages ago. The context window did not survive the HTTP boundary.

An agent without long-term memory treats every returning user as a stranger. Preferences, past decisions, domain context: all gone. A user who spent three sessions teaching the agent about their non-standard data model will have to do it again.

None of these are prompting problems. They are architectural gaps. Building agentic applications with Prism PHP covers the tool-calling layer; this article addresses what that layer needs underneath it to operate coherently over time.

The Three Memory Layers

Before writing any code, establish the taxonomy. Every subsequent decision flows from it.

Layer 1 : In-context memory (conversation history). The messages array passed to the model on each request. Backed by Eloquent for persistence across HTTP boundaries, loaded into the Prism message chain on each turn. Ephemeral from the model’s perspective; durable from the database’s.

Layer 2 : Session memory (working context). Distilled facts the agent accumulates during a working session. Not the full message history: structured state. “The user is working on project 42.” “The refund was approved.” Backed by Redis with a session-scoped TTL. Injected into the system prompt as a context block before each turn.

Layer 3 : Long-term memory (semantic recall). Persistent facts that should influence future sessions. “This user always prefers JSON output.” “This client’s codebase uses a non-standard base model.” Backed by pgvector. Retrieved semantically before each session rather than loaded exhaustively.

The diagram below shows how each layer connects to its storage backend and where it enters the prompt:

Each layer is independent. You can implement Layer 1 alone and have a functional chatbot. Adding Layer 2 gives the agent working memory. Layer 3 turns it into something that learns.

Layer 1 : Conversation History with Eloquent

This is what most developers implement first, and the layer most likely to cause production problems if left unconstrained.

Start with the migration:

Schema::create('agent_conversations', function (Blueprint $table) {
    $table->id();
    $table->foreignId('user_id')->constrained()->cascadeOnDelete();
    $table->string('session_id')->index();
    $table->enum('role', ['user', 'assistant', 'tool']);
    $table->text('content');
    $table->string('tool_name')->nullable();
    $table->json('tool_result')->nullable();
    $table->unsignedInteger('token_count')->default(0);
    $table->timestamps();
});

The token_count column is intentional. Use it to populate approximate token usage per turn from $response->usage->completionTokens after each Prism call. You will need it for budget enforcement later.

The ConversationMemoryService must return Prism Message objects, not raw arrays. withMessages() accepts UserMessage and AssistantMessage instances. Passing plain associative arrays will fail silently in some builds and throw in others. Map explicitly:

use Prism\Prism\ValueObjects\Messages\UserMessage;
use Prism\Prism\ValueObjects\Messages\AssistantMessage;

class ConversationMemoryService
{
    public function loadHistory(string $sessionId, int $limit = 20): array
    {
        return AgentConversation::where('session_id', $sessionId)
            ->whereIn('role', ['user', 'assistant'])
            ->orderByDesc('created_at')
            ->limit($limit)
            ->get()
            ->reverse()
            ->map(fn ($msg) => match ($msg->role) {
                'user'      => new UserMessage($msg->content),
                'assistant' => new AssistantMessage($msg->content),
            })
            ->values()
            ->toArray();
    }

    public function append(
        string $sessionId,
        string $role,
        string $content,
        int $userId,
        int $tokenCount = 0
    ): void {
        AgentConversation::create([
            'session_id'  => $sessionId,
            'user_id'     => $userId,
            'role'        => $role,
            'content'     => $content,
            'token_count' => $tokenCount,
        ]);
    }
}

The orderByDesc with limit followed by reverse() is deliberate: fetch the most recent N turns, then restore chronological order for the model. Fetching ASC with no limit is the default that breaks first in production.

[Production Pitfall] Loading unbounded conversation history into the context window is the single most common production failure in Laravel chatbot implementations. A session that runs 200 turns in a support workflow will exhaust even a 200k-token context window when you account for system prompt, tool schemas, and response tokens. The sliding window above is the minimum viable safeguard. For long-running sessions, add a summarisation step: compress turns older than the window into a single AssistantMessage summary block, then prepend it. The token budget and inference parameter controls covered in the inference control guide apply directly here.

Tool turns (role = 'tool') are excluded from the history load. ToolResultMessage has a more complex structure requiring tool call IDs from the originating AssistantMessage. If your agent uses multi-step tool calling, persist and reconstruct those pairs separately. The human-in-the-loop approval workflow shows how tool call state survives across request boundaries in a queue-backed architecture.

Layer 2 : Session Memory with Redis

Conversation history gives the agent the transcript. Session memory gives it the distilled facts. These are different things. “The user said X at 14:32” is history. “The user is working on project 42” is working context. Injecting both as raw history is wasteful and noisy. Keep them separate.

class SessionMemoryService
{
    private string $prefix = 'agent_session:';

    public function remember(
        string $sessionId,
        string $key,
        mixed $value,
        int $ttlMinutes = 60
    ): void {
        $memory              = $this->recall($sessionId);
        $memory[$key]        = $value;

        Cache::put(
            $this->prefix . $sessionId,
            $memory,
            now()->addMinutes($ttlMinutes)
        );
    }

    public function recall(string $sessionId): array
    {
        return Cache::get($this->prefix . $sessionId, []);
    }

    public function forget(string $sessionId, string $key): void
    {
        $memory = $this->recall($sessionId);
        unset($memory[$key]);
        Cache::put($this->prefix . $sessionId, $memory, now()->addHour());
    }

    public function toSystemPromptBlock(string $sessionId): string
    {
        $memory = $this->recall($sessionId);

        if (empty($memory)) {
            return '';
        }

        $facts = collect($memory)
            ->map(fn ($v, $k) => "- {$k}: {$v}")
            ->implode("\n");

        return "Current session context:\n{$facts}";
    }
}

Always set an explicit TTL. Redis session keys without expiry are a slow leak. At scale, a support tool processing hundreds of concurrent sessions will accumulate hundreds of megabytes of stale working context if TTLs are omitted or set too long. Sixty minutes is a reasonable default for interactive sessions. Batch processing workflows may warrant shorter windows.

Layer 3 : Long-Term Memory with pgvector

Session memory expires. Long-term memory persists. The distinction matters: a user’s formatting preferences are worth keeping indefinitely; which ticket they were looking at last Tuesday is not.

The schema mirrors what the embeddings and vector database guide covers for RAG pipelines, applied here to agent-specific memory:

Schema::create('agent_memories', function (Blueprint $table) {
    $table->id();
    $table->foreignId('user_id')->constrained()->cascadeOnDelete();
    $table->text('content');
    $table->string('memory_type')->default('preference');
    $table->timestamps();
});

DB::statement('ALTER TABLE agent_memories ADD COLUMN embedding vector(1536)');
DB::statement(
    'CREATE INDEX agent_memories_embedding_idx
     ON agent_memories USING hnsw (embedding vector_cosine_ops)'
);

The LongTermMemoryService generates an embedding for each stored memory and retrieves semantically relevant ones by cosine distance:

class LongTermMemoryService
{
    public function __construct(
        private EmbeddingService $embeddings
    ) {}

    public function store(int $userId, string $content, string $type = 'preference'): void
    {
        $embedding = $this->embeddings->generate($content);

        $memory = AgentMemory::create([
            'user_id'     => $userId,
            'content'     => $content,
            'memory_type' => $type,
        ]);

        DB::statement(
            'UPDATE agent_memories SET embedding = ?::vector WHERE id = ?',
            ['[' . implode(',', $embedding) . ']', $memory->id]
        );
    }

    public function retrieve(int $userId, string $query, int $limit = 5): array
    {
        $queryEmbedding = $this->embeddings->generate($query);
        $vectorString   = '[' . implode(',', $queryEmbedding) . ']';

        return DB::select(
            "SELECT content, memory_type,
                    embedding <=> ?::vector AS distance
             FROM agent_memories
             WHERE user_id = ?
             ORDER BY distance
             LIMIT ?",
            [$vectorString, $userId, $limit]
        );
    }

    public function toSystemPromptBlock(int $userId, string $currentQuery): string
    {
        $memories = $this->retrieve($userId, $currentQuery);

        if (empty($memories)) {
            return '';
        }

        $facts = collect($memories)
            ->map(fn ($m) => "- {$m->content}")
            ->implode("\n");

        return "Relevant context about this user:\n{$facts}";
    }
}

The embedding generation in store() is synchronous here for clarity. In production, dispatch it as a queued job. Generating an embedding on the write path adds 100–200ms of latency to every memory store operation, which is acceptable in a background worker and unacceptable inline.

Wiring the Three Layers Together

The MemoryAwareAgentService composes all three layers into a single request lifecycle. There is one ordering constraint worth understanding: Layer 3 and Layer 2 both inject into the system prompt. Layer 1 goes into withMessages(). The current user message goes into withPrompt(). These are separate. Do not append the current user message to the Eloquent history before calling Prism, it will be sent twice.

use Prism\Prism\Enums\Provider;
use Prism\Prism\Facades\Prism;

class MemoryAwareAgentService
{
    public function __construct(
        private ConversationMemoryService $conversation,
        private SessionMemoryService      $session,
        private LongTermMemoryService     $longTerm
    ) {}

    public function respond(
        int    $userId,
        string $sessionId,
        string $userMessage
    ): string {
        $longTermContext = $this->longTerm->toSystemPromptBlock($userId, $userMessage);
        $sessionContext  = $this->session->toSystemPromptBlock($sessionId);

        $systemPrompt = "You are a helpful production assistant."
            . ($longTermContext ? "\n\n{$longTermContext}" : '')
            . ($sessionContext  ? "\n\n{$sessionContext}"  : '');

        // Load prior turns only — current message enters via withPrompt()
        $history = $this->conversation->loadHistory($sessionId);

        try {
            $response = Prism::text()
                ->using(Provider::Anthropic, 'claude-sonnet-4-6')
                ->withSystemPrompt($systemPrompt)
                ->withMessages($history)
                ->withPrompt($userMessage)
                ->withMaxTokens(2048)
                ->withClientRetry(3, 100)
                ->asText();
        } catch (\Throwable $e) {
            throw new \RuntimeException(
                'Agent response failed: ' . $e->getMessage(),
                0,
                $e
            );
        }

        // Persist both turns after a successful response
        $this->conversation->append($sessionId, 'user', $userMessage, $userId);
        $this->conversation->append(
            $sessionId,
            'assistant',
            $response->text,
            $userId,
            $response->usage->completionTokens
        );

        return $response->text;
    }
}

withClientRetry(3, 100) uses Prism’s built-in retry via Laravel’s HTTP client: three attempts with 100ms initial backoff. For rate limit detection beyond that, inspect the HTTP status code in the exception and back off accordingly. The production architecture governance guide covers retry strategies with circuit breakers in more depth.

Persist both turns only after a successful response. If the Prism call throws, neither turn is written. This keeps the Eloquent history consistent with what the model actually processed.

Memory Extraction: Teaching the Agent to Remember

The three services above handle reading memory. They do not handle writing it automatically. An agent that cannot identify and store new persistent facts is only half a memory system.

Two approaches exist:

Approach	How it triggers	Model cost	Latency impact	Best for
Extraction prompt	Secondary completion after each turn	Low (haiku / mini)	Negligible if queued	Simple preference tracking
Tool-based self-storage	Agent invokes a `remember_fact` tool autonomously	None (uses main call)	None	Complex agentic workflows

The extraction prompt runs a secondary low-cost completion against the assistant’s response, asking the model to identify any facts worth persisting. Dispatch it as a queued job so it does not block the primary response:

// In a queued job
$extraction = Prism::text()
    ->using(Provider::Anthropic, 'claude-haiku-4-5-20251001')
    ->withSystemPrompt(
        'Extract any persistent facts about the user from this conversation turn. '
        . 'Return a JSON array of strings. Return an empty array if none.'
    )
    ->withPrompt("User: {$userMessage}\nAssistant: {$assistantResponse}")
    ->withMaxTokens(256)
    ->asText();

$facts = json_decode($extraction->text, true) ?? [];

foreach ($facts as $fact) {
    $this->longTerm->store($userId, $fact);
}

[Efficiency Gain] Use claude-haiku-4-5-20251001 or gpt-4o-mini for extraction. The cost is negligible and the latency impact disappears when dispatched via a queued job. Running extraction on the primary model is unnecessary: the task is classification, not generation.

[Architect’s Note] The tool-based approach gives the agent autonomy over what it remembers. Cleaner architecture, but it requires the model to exercise consistent judgment about what is worth persisting. In practice, models tend to over-store: every session detail becomes a candidate for long-term memory. Pair tool-based self-storage with a confidence threshold and a review queue before committing. The schema validation patterns for agentic workflows describe how to validate structured model output before it reaches persistent storage.

Memory Maintenance: TTLs, Limits, and Deletion

Memory systems accumulate noise. Production deployments discover this around the three-month mark, when retrieval quality degrades and storage costs become visible.

Three maintenance concerns apply:

Conversation history pruning. Archive sessions older than 90 days to cold storage. Delete the associated rows from agent_conversations. If you need a historical record, write a compressed summary to a separate archive table first. Do not keep unbounded raw history in your primary database.

Session memory TTLs. Already covered above, but worth repeating: never let Redis session keys persist indefinitely. If your application resumes sessions across multiple days, extend the TTL on each access rather than setting an indefinite expiry.

Long-term memory deduplication. The most insidious hygiene problem. If a user interacts with the agent daily for a month, you may have thirty near-identical entries for “User prefers concise responses.” Every retrieval returns noise. A scheduled job identifying and merging memories with cosine distance below a threshold (0.05 is a reasonable starting point) keeps the long-term store clean.

A scheduled Artisan command covers all three:

// routes/console.php
Schedule::command('agent:memory-hygiene')->daily()->at('03:00');

// app/Console/Commands/AgentMemoryHygiene.php
public function handle(): void
{
    // Archive and prune old sessions
    AgentConversation::where('created_at', '<', now()->subDays(90))
        ->chunkById(500, function ($messages) {
            // Write summary to archive, then delete
            $messages->each->delete();
        });

    // Deduplicate near-identical long-term memories per user
    // Implementation varies by pgvector version — use a self-join
    // on cosine distance below your threshold
}

[Edge Case Alert] Semantically similar memories do not need to be textually identical to cause retrieval pollution. “User prefers brief answers,” “Keep responses concise,” and “User dislikes long explanations” are three distinct strings with embeddings that cluster tightly. All three will surface in most retrievals. Deduplication based on cosine distance catches these where exact string matching cannot.

A three-layer architecture separates three genuinely different problems: what the model is currently processing (Layer 1), what the agent has learned about this session (Layer 2), and what it knows about this user across all sessions (Layer 3). Keeping them separate keeps each layer maintainable. Conflating them (loading all three into one undifferentiated system prompt block), produces a brittle, expensive context that grows without bound.

The implementation above gives you the foundation. Production tuning comes through observation: watch which memory types surface in retrieval, measure how often the extraction prompt identifies genuinely useful facts, and tune the deduplication threshold against your actual data distribution.

Frequently Asked Questions

How many conversation turns should I keep in the sliding window?

Twenty turns covers most interactive sessions without approaching context limits on a 200k-token model. The right number depends on average message length. If your users send long structured messages, halve it. Instrument token_count per session and alert when cumulative history approaches 30% of the model’s context window.

Can I use a different embedding model for long-term memory?

Yes. The 1536-dimension column above targets OpenAI’s text-embedding-3-small. If you switch to a different embedding model with a different output dimension, you will need to re-embed all existing memories. Plan for that before going to production. The embedding models covered in the Laravel embeddings implementation guide include dimension specs for the main options.

What happens if the pgvector retrieval is slow?

The HNSW index handles most query patterns well at moderate scale. If retrieval latency becomes a concern, check that the index was built with appropriate m and ef_construction parameters for your dataset size. For very large memory stores (millions of records), consider partitioning agent_memories by user_id range.

Should session memory survive a page refresh?

That depends on whether you treat the session as browser-session-scoped or user-session-scoped. If you key Redis on a UUID you persist in a cookie or local storage, it survives the refresh. If you generate a new session ID on each page load, it does not. Make this an explicit decision, not an accident.

How do I handle memory for unauthenticated users?

Use a session fingerprint (hashed IP + user agent + timestamp bucket) as the user_id equivalent for Layer 2 and a shorter TTL (15–30 minutes). Skip Layer 3 entirely for unauthenticated users, there is no durable identity to anchor long-term memory to.

Dewald Hugo

A software architect with 15+ years of experience in the PHP and Laravel ecosystem. Dewald created Origin Main to provide the engineering rigour required to integrate AI into professional, high-concurrency production systems. He writes for developers who care less about "getting it to work" and more about "getting it to last".

Laravel AI Agent Memory: Persisting Context Across Conversations and Sessions