If your Laravel application uses a language model for anything beyond one-shot completions — document Q&A, knowledge bases, contextual search, agentic memory — you will eventually need to solve the same problem: LLMs do not have persistent memory, and context windows are finite and expensive. Embeddings and vector databases are the infrastructure layer that fixes this.
This guide covers laravel vector database integration from first principles through to a production-grade RAG pipeline. We will build a real implementation: pgvector on PostgreSQL, an injectable EmbeddingService, queue-backed document ingestion, and a similarity search query you can drop into any Laravel 11 or 12 application. No Python. No generic boilerplate.
What Are Embeddings, Specifically?
An embedding is a fixed-length numerical vector that represents the semantic meaning of a piece of text. Two sentences with similar meanings will produce vectors that are mathematically close together, even if they share zero words.
"How do I reset my password?" → vector A "Forgot password recovery process" → vector B cosine_similarity(A, B) ≈ 0.94
"How do I reset my password?" → vector A "What is the capital of France?" → vector C cosine_similarity(A, C) ≈ 0.12
This is what makes semantic search fundamentally different from LIKE '%password%'. You are measuring conceptual distance, not string overlap. For Laravel applications serving knowledge-intensive features — support bots, document search, policy lookup — this distinction is the difference between a system that feels intelligent and one that feels brittle.
Embedding models are trained entirely differently from generation models. Their training objective is not next-token prediction. It is to pull semantically related inputs close together in vector space and push unrelated inputs apart. OpenAI’s text-embedding-3-small and text-embedding-3-large are the current production-grade options. Provider selection matters here more than most developers expect, not every generation provider ships a native embedding model. Our guide to Laravel AI provider selection covers embedding capability as part of the broader provider comparison, including which providers require you to mix-and-match for generation versus retrieval. The small model (1536 dimensions) is the right default for most Laravel applications. It costs approximately 20x less than text-embedding-3-large with acceptable quality for retrieval tasks.
Why Traditional Databases Break Down Here
You might be tempted to store embedding vectors in a standard MySQL or PostgreSQL column and run similarity queries with a bit of math. You can. It will work at 500 rows. At 50,000 rows it will be unusably slow.
The issue is indexing. Relational databases are built around B-tree and hash indexes, which are optimised for exact matches and range scans. Finding the nearest vector in high-dimensional space is a fundamentally different problem — it requires approximate nearest neighbour (ANN) algorithms and specialised index structures like HNSW (Hierarchical Navigable Small World) or IVFFlat.
For Laravel applications already running PostgreSQL, the answer is pgvector. It is a PostgreSQL extension that adds a native vector column type, ANN indexes, and distance operators directly into your existing database. No new infrastructure. No separate service to manage. No Pinecone subscription. For most Laravel shops this is the correct architectural choice, especially at the scale of a typical SaaS product.
Setting Up pgvector with Laravel 11/12
Enable the Extension
On a fresh PostgreSQL instance:
CREATE EXTENSION IF NOT EXISTS vector;
If you are using Laravel Forge or a managed database (e.g., Supabase, RDS), pgvector is typically available as an extension you enable from the dashboard. On a self-managed server:
# Ubuntu/Debian sudo apt install postgresql-16-pgvector # Or build from source cd /tmp && git clone https://github.com/pgvector/pgvector.git cd pgvector && make && sudo make install
Then in your .env, ensure DB_CONNECTION=pgsql.
The Migration
Laravel’s Blueprint does not have a native vector column type. We reach for DB::statement to add it cleanly after the table is created.
<?php
use Illuminate\Database\Migrations\Migration;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Schema;
return new class extends Migration
{
public function up(): void
{
DB::statement('CREATE EXTENSION IF NOT EXISTS vector');
Schema::create('document_chunks', function (Blueprint $table) {
$table->id();
$table->foreignId('document_id')->constrained()->cascadeOnDelete();
$table->text('content');
$table->jsonb('metadata')->nullable(); // source, page, section, etc.
$table->unsignedInteger('chunk_index')->default(0);
$table->timestamps();
});
// text-embedding-3-small outputs 1536 dimensions
DB::statement('ALTER TABLE document_chunks ADD COLUMN embedding vector(1536)');
// IVFFlat index for approximate nearest neighbour search
// lists = sqrt(row count) is a sensible starting heuristic
DB::statement(
'CREATE INDEX document_chunks_embedding_idx
ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100)'
);
}
public function down(): void
{
Schema::dropIfExists('document_chunks');
}
};
[Architect’s Note] The IVFFlat index requires the table to contain data before it can be trained effectively. If you are bulk-importing documents, insert all rows first and then create the index in a separate migration or artisan command. Creating the index on an empty table and then inserting data defeats the purpose — the index lists will be poorly distributed. For tables exceeding ~1M rows, evaluate the HNSW index instead; it offers better recall at the cost of higher build time and memory.
The Eloquent Model
<?php
namespace App\Models;
use Illuminate\Database\Eloquent\Model;
use Illuminate\Database\Eloquent\Relations\BelongsTo;
class DocumentChunk extends Model
{
protected $fillable = [
'document_id',
'content',
'metadata',
'chunk_index',
'embedding',
];
protected $casts = [
'metadata' => 'array',
];
public function document(): BelongsTo
{
return $this->belongsTo(Document::class);
}
}
The embedding column is stored and retrieved as a string from PostgreSQL. We will handle serialisation in the service layer.
The EmbeddingService
This is where the Laravel Service Container earns its keep. We define a dedicated service, bind it as a singleton, and inject it wherever we need embeddings — Jobs, Controllers, Artisan commands.
<?php
namespace App\Services\AI;
use Illuminate\Support\Facades\Http;
use Illuminate\Support\Facades\Log;
use Illuminate\Http\Client\RequestException;
use RuntimeException;
class EmbeddingService
{
private const MODEL = 'text-embedding-3-small';
private const MAX_RETRIES = 3;
public function __construct(
private readonly string $apiKey = '',
) {}
/**
* Embed a single string. Returns a float[].
*/
public function embed(string $text): array
{
$text = $this->sanitise($text);
return $this->callWithRetry($text);
}
/**
* Embed a batch of strings (up to 2048 inputs per OpenAI request).
* Returns an array of float[], indexed to match the input array.
*/
public function embedBatch(array $texts): array
{
$texts = array_map(fn(string $t) => $this->sanitise($t), $texts);
$chunks = array_chunk($texts, 100, preserve_keys: true);
$results = [];
foreach ($chunks as $chunk) {
$response = $this->request(array_values($chunk));
foreach ($response['data'] as $item) {
$results[$item['index']] = $item['embedding'];
}
}
ksort($results);
return array_values($results);
}
private function callWithRetry(string $text): array
{
$attempts = 0;
while ($attempts < self::MAX_RETRIES) {
try {
$response = $this->request($text);
return $response['data'][0]['embedding'];
} catch (RequestException $e) {
$status = $e->response->status();
if ($status === 429) {
// Respect Retry-After header when present
$retryAfter = (int) ($e->response->header('Retry-After') ?: 2 ** $attempts);
Log::warning('OpenAI rate limit hit on embeddings', [
'attempt' => $attempts + 1,
'retry_after' => $retryAfter,
]);
sleep($retryAfter);
$attempts++;
continue;
}
// Non-retryable errors: 400, 401, 403, etc.
Log::error('OpenAI embedding request failed', [
'status' => $status,
'body' => $e->response->body(),
]);
throw new RuntimeException(
"Embedding request failed with HTTP {$status}",
previous: $e
);
}
}
throw new RuntimeException('Embedding request exceeded max retries after rate limiting.');
}
private function request(string|array $input): array
{
$response = Http::withToken($this->apiKey)
->timeout(30)
->post('https://api.openai.com/v1/embeddings', [
'model' => self::MODEL,
'input' => $input,
])
->throw();
return $response->json();
}
private function sanitise(string $text): string
{
// Strip null bytes, normalise whitespace
$text = str_replace("\0", '', $text);
return trim(preg_replace('/\s+/', ' ', $text));
}
}
Register it in bootstrap/app.php (Laravel 11/12 pattern — no more AppServiceProvider by default for singletons if you prefer the clean approach):
// bootstrap/app.php
use App\Services\AI\EmbeddingService;
return Application::configure(basePath: dirname(__DIR__))
->withRouting(...)
->withMiddleware(...)
->withExceptions(...)
->withProviders([])
// Keep this in AppServiceProvider for clarity:
->booted(function () {
// Alternatively, register in AppServiceProvider::register()
})
->create();
// app/Providers/AppServiceProvider.php
use App\Services\AI\EmbeddingService;
public function register(): void
{
$this->app->singleton(EmbeddingService::class, function () {
return new EmbeddingService(
apiKey: config('services.openai.key'),
);
});
}
Chunking Strategy: The Part That Actually Determines Quality
Before you can embed a document, you need to split it into chunks. This is not a trivial decision. Chunking strategy is arguably the single largest determinant of RAG quality — and the most commonly underestimated.
<?php
namespace App\Services\AI;
class DocumentChunker
{
public function __construct(
private readonly int $maxTokens = 400,
private readonly int $overlapTokens = 50,
) {}
/**
* Chunk by paragraph with overlap.
* Overlap preserves context at boundaries — critical for coherent retrieval.
*/
public function chunk(string $text): array
{
$paragraphs = preg_split('/\n{2,}/', $text, flags: PREG_SPLIT_NO_EMPTY);
$chunks = [];
$current = '';
$currentTokens = 0;
$overlap = '';
foreach ($paragraphs as $paragraph) {
$paragraphTokens = $this->estimateTokens($paragraph);
if ($currentTokens + $paragraphTokens > $this->maxTokens && $current !== '') {
$chunks[] = trim($overlap . ' ' . $current);
$overlap = $this->extractOverlap($current);
$current = $paragraph;
$currentTokens = $paragraphTokens;
} else {
$current .= "\n\n" . $paragraph;
$currentTokens += $paragraphTokens;
}
}
if (trim($current) !== '') {
$chunks[] = trim($overlap . ' ' . $current);
}
return array_values(array_filter($chunks, fn($c) => strlen($c) > 50));
}
/**
* Approximate token count: ~4 characters per token is a reliable heuristic for English text.
*/
private function estimateTokens(string $text): int
{
return (int) ceil(mb_strlen($text) / 4);
}
private function extractOverlap(string $text): string
{
$words = explode(' ', $text);
$overlapWordCount = (int) ceil($this->overlapTokens * 0.75);
return implode(' ', array_slice($words, -$overlapWordCount));
}
}
[Production Pitfall] Chunks that are too large reduce retrieval precision — you get back big blobs of loosely related content, and the LLM gets noisy context. Chunks that are too small lose coherence at the boundaries, breaking sentences and stripping context. A 300–500 token window with 50-token overlap is a safe starting point, but the right number is always application-specific and should be evaluated against real queries. Log your retrieval results from day one. You cannot tune what you do not measure.
Queue-Based Document Ingestion
Never embed documents synchronously in a controller. Embedding API calls take 200–800ms per request, and batch jobs for large documents will time out your HTTP request. The correct pattern is a dispatchable Job.
<?php
namespace App\Jobs;
use App\Models\Document;
use App\Models\DocumentChunk;
use App\Services\AI\DocumentChunker;
use App\Services\AI\EmbeddingService;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Log;
use Throwable;
class EmbedDocumentJob implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
public int $tries = 3;
public int $backoff = 30; // seconds between retries
public function __construct(
private readonly Document $document,
) {}
public function handle(EmbeddingService $embedding, DocumentChunker $chunker): void
{
$chunks = $chunker->chunk($this->document->content);
if (empty($chunks)) {
Log::warning('Document produced zero chunks', ['document_id' => $this->document->id]);
return;
}
// Embed in batches to reduce API round trips
$vectors = $embedding->embedBatch($chunks);
DB::transaction(function () use ($chunks, $vectors) {
// Delete existing chunks before re-embedding
$this->document->chunks()->delete();
$rows = array_map(function (string $chunk, int $index) use ($vectors) {
return [
'document_id' => $this->document->id,
'content' => $chunk,
'chunk_index' => $index,
'metadata' => json_encode([
'source' => $this->document->source ?? 'unknown',
'title' => $this->document->title,
]),
'embedding' => '[' . implode(',', $vectors[$index]) . ']',
'created_at' => now(),
'updated_at' => now(),
];
}, $chunks, array_keys($chunks));
// Chunk inserts to avoid hitting PostgreSQL parameter limits
foreach (array_chunk($rows, 50) as $batch) {
DB::table('document_chunks')->insert($batch);
}
});
Log::info('Document embedded successfully', [
'document_id' => $this->document->id,
'chunk_count' => count($chunks),
]);
}
public function failed(Throwable $exception): void
{
Log::error('EmbedDocumentJob failed', [
'document_id' => $this->document->id,
'error' => $exception->getMessage(),
]);
}
}
Dispatch it from a controller or observer:
EmbedDocumentJob::dispatch($document)->onQueue('embeddings');
Define a dedicated embeddings queue worker in your config/queue.php and on Forge/Vapor so you can control concurrency independently of your main queue.
The Similarity Search Query
With documents embedded and stored, similarity search is a single Eloquent query.
<?php
namespace App\Services\AI;
use App\Models\DocumentChunk;
use Illuminate\Support\Collection;
use Illuminate\Support\Facades\DB;
class VectorSearchService
{
public function __construct(
private readonly EmbeddingService $embedding,
) {}
/**
* Return the top-k most semantically similar chunks for a given query.
*/
public function search(string $query, int $limit = 5, float $threshold = 0.7): Collection
{
$queryVector = $this->embedding->embed($query);
$vectorLiteral = '[' . implode(',', $queryVector) . ']';
return DB::table('document_chunks')
->select([
'id',
'document_id',
'content',
'metadata',
'chunk_index',
DB::raw("1 - (embedding <=> '{$vectorLiteral}'::vector) AS similarity"),
])
->whereRaw("1 - (embedding <=> '{$vectorLiteral}'::vector) >= ?", [$threshold])
->orderByDesc('similarity')
->limit($limit)
->get();
}
}
The <=> operator is pgvector’s cosine distance operator. Subtracting from 1 converts distance to similarity. A threshold of 0.7 is a reasonable production default — below that and you start retrieving chunks that are semantically irrelevant.
[Edge Case Alert] PostgreSQL will parameterise most query values safely, but vector literals passed through
DB::rawneed careful handling. In the example above the vector is constructed from a float array you control — do not interpolate user input into the vector literal directly. Always embed the user’s query server-side first, then use the resulting float array.
Building the RAG Pipeline
The search service is the retrieval half. Now let us wire it into a full RAG pipeline that passes retrieved context to the language model.
<?php
namespace App\Services\AI;
use Illuminate\Support\Facades\Http;
use Illuminate\Support\Facades\Log;
class RagPipelineService
{
public function __construct(
private readonly VectorSearchService $search,
private readonly string $openAiKey = '',
) {}
public function answer(string $userQuery, int $contextChunks = 4): string
{
// 1. Retrieve semantically relevant chunks
$chunks = $this->search->search($userQuery, limit: $contextChunks);
if ($chunks->isEmpty()) {
return "I could not find relevant information to answer that question.";
}
// 2. Build grounded context block
$context = $chunks->map(fn($chunk) => $chunk->content)->implode("\n\n---\n\n");
// 3. Construct the prompt
$systemPrompt = <<<PROMPT
You are a helpful assistant. Answer the user's question using ONLY the context provided below.
If the answer is not contained in the context, say so clearly. Do not fabricate information.
Context:
{$context}
PROMPT;
// 4. Call the generation model
try {
$response = Http::withToken($this->openAiKey)
->timeout(60)
->post('https://api.openai.com/v1/chat/completions', [
'model' => 'gpt-4o-mini',
'messages' => [
['role' => 'system', 'content' => $systemPrompt],
['role' => 'user', 'content' => $userQuery],
],
'max_tokens' => 800,
'temperature' => 0.2, // low temperature for grounded, factual responses
])
->throw()
->json();
return $response['choices'][0]['message']['content'] ?? 'No response generated.';
} catch (\Throwable $e) {
Log::error('RAG generation failed', ['error' => $e->getMessage()]);
throw $e;
}
}
}
A few important choices in that implementation worth highlighting. The temperature is set to 0.2. For RAG, you want the model to synthesise from context, not freestyle. Higher temperatures introduce hallucination risk in exactly the scenario you built RAG to prevent. The max_tokens cap at 800 keeps costs predictable; adjust based on your expected response length.
If you are already using Prism PHP in your application, the library’s native RAG support and multi-provider abstraction can replace the direct Http::post() calls here with a considerably cleaner interface — particularly if you need to swap between OpenAI and Anthropic without rewriting the pipeline.
Caching Embeddings with Redis
Re-embedding identical strings on every request is wasteful and costly. Common queries, navigation labels, UI strings — anything that does not change should be cached.
<?php
namespace App\Services\AI;
use Illuminate\Support\Facades\Cache;
class CachedEmbeddingService extends EmbeddingService
{
private const TTL = 60 * 24 * 7; // 7 days in minutes
public function embed(string $text): array
{
$key = 'embedding:' . hash('sha256', $text);
return Cache::store('redis')->remember($key, self::TTL * 60, function () use ($text) {
return parent::embed($text);
});
}
}
Bind this as your production implementation in AppServiceProvider:
$this->app->singleton(EmbeddingService::class, function () {
return new CachedEmbeddingService(
apiKey: config('services.openai.key'),
);
});
[Efficiency Gain] On a typical knowledge-base application, 60–80% of embedding API calls are for repeated or near-identical queries (“what is your refund policy?” appears in dozens of variations). A Redis cache with a 7-day TTL and SHA-256 keying will reduce your embedding API spend significantly — often by more than half within a few weeks of production traffic. Track cache hit rate as a metric. If it stays below 20%, your chunks are too variable or your queries are too diverse to benefit.
Production Considerations
Metadata Filtering
Do not retrieve across your entire corpus indiscriminately. Always filter by tenant, user, document category, or access level before executing the ANN search:
return DB::table('document_chunks')
->select([...])
->where('metadata->tenant_id', auth()->user()->tenant_id) // JSONB operator
->whereRaw("1 - (embedding <=> ?::vector) >= ?", [$vectorLiteral, $threshold])
->orderByDesc('similarity')
->limit($limit)
->get();
Failing to scope by tenant in a multi-tenant application is a data leak waiting to happen.
Index Maintenance
IVFFlat indexes need to be rebuilt periodically as data grows. Set a reminder or scheduled command to run REINDEX INDEX document_chunks_embedding_idx during a low-traffic window if your corpus is actively growing.
Cost Tracking
If you are already tracking token usage through a Laravel middleware layer for AI API costs, embeddings should feed into that same telemetry. The text-embedding-3-small model costs $0.02 per million tokens, which sounds trivial until you are embedding 10,000 document chunks per day.
Common Mistakes in Production Laravel RAG
Chunking at the wrong boundary. Splitting at a fixed character count breaks sentences. Paragraphs are the natural semantic unit for most prose documents. Respect them.
Embedding everything without metadata. A vector alone tells you nothing about origin, age, or access rights. Always store metadata alongside your chunks. You will need it for filtering, debugging, and attribution.
Trusting retrieved content blindly. The LLM will do whatever the retrieved content suggests if your system prompt allows it. Always instruct the model to stay within the provided context and always surface source attribution to the user.
Ignoring index warm-up. IVFFlat indexes perform poorly on cold queries immediately after an instance restart or a table vacuum. If your application has predictable traffic spikes, run a warm-up query on startup.
Synchronous embedding in request handlers. Covered above. Use queues. Always. A 500ms embedding call in a controller is a 500ms p99 latency spike your users will feel.
Not evaluating retrieval quality. Most RAG failures are retrieval failures, not model failures. Log the chunks returned for each query in development. Review them manually. If what you retrieve does not contain the answer, no generation model can fix it.
What Comes Next
The implementation above gives you a functional, production-viable RAG pipeline in Laravel. The next architectural challenge is hardening the LLM output layer — specifically, validating that the generation model’s response actually stays within the retrieved context and conforms to your business schema. That is a problem of structured output and schema enforcement, covered in depth in Hardening Laravel Agentic Workflows: Schema Validation Against LLM Hallucinations.
If you are evaluating the service architecture around this pipeline — contracts, provider abstraction, telemetry — Production-Grade AI Architecture in Laravel is the logical companion piece.
Embeddings are not a feature. They are infrastructure. Build them accordingly.
Further Reading
- OpenAI Embeddings API Reference — model comparison, dimension reduction, and batching limits.
- pgvector on GitHub — index type comparison (IVFFlat vs HNSW), operator documentation, and PostgreSQL version compatibility matrix.
A software architect with 15+ years of experience in the PHP and Laravel ecosystem. Dewald created Origin Main to provide the engineering rigour required to integrate AI into professional, high-concurrency production systems. He writes for developers who care less about "getting it to work" and more about "getting it to last".

