Laravel Octane vs PHP-FPM

Laravel Octane vs PHP-FPM for AI Workloads: What the Benchmarks Actually Show

The choice between Laravel Octane vs PHP-FPM looks different the moment LLM API calls enter the picture. For a standard CRUD application, PHP-FPM’s shared-nothing model is reliable, well-understood, and good enough for almost every traffic pattern you will realistically encounter. For AI workloads, the calculus shifts. An LLM API call is not a fast database query. A streaming Claude or Gemini response holds an open HTTP connection for anywhere between 3 and 30 seconds depending on output length and model. PHP-FPM ties up a worker for every second of that wait. That single behaviour, not bootstrap overhead, not memory pressure, is the primary problem at scale.

This article covers what the benchmarks actually show, where the gains come from, and what you need to fix before you migrate. All examples use Swoole as the Octane driver. The Laravel Octane documentation and Open Swoole documentation are the authoritative references for installation and driver-specific configuration. We will explain why Swoole is the focus.

This article is part of the AI Deployment & Production Operations module, which covers infrastructure provisioning, queue configuration, and production operations for Laravel AI pipelines.

The PHP-FPM Problem Is Specifically an I/O Problem

PHP-FPM’s shared-nothing model is a feature, not a flaw. A crashed worker does not affect others. Memory leaks reset at request boundaries. You get a clean slate on every request without writing any cleanup code. For workloads where most processing time is CPU or database I/O, this is a reasonable bargain.

AI workloads break that bargain. When your Laravel application dispatches a request to an LLM provider, the PHP-FPM worker sits idle while it waits on the network. A non-streaming call to GPT-4o returning a 500-token response takes 1–4 seconds. A streaming Gemini generation over a long document runs 20–30 seconds. During every one of those seconds, the worker is occupied and unavailable to serve anything else.

Under concurrent load this compounds quickly. Ten FPM workers, each holding an open LLM connection, means ten is your effective concurrency ceiling for outbound AI calls. New requests queue. Latency climbs. The typical response is horizontal scaling, which works, but it is paying for infrastructure to compensate for an architectural mismatch rather than addressing the mismatch.

[Production Pitfall] Teams running LLM pipelines through PHP-FPM at volume consistently hit worker exhaustion before CPU or memory pressure. The symptom is a sharp P99 latency spike under moderate concurrency, not under high load. Twenty concurrent users triggering 5-second LLM responses is enough to saturate a ten-worker FPM pool. If this pattern looks familiar, that is what you are hitting.

The framework bootstrap overhead (10–30ms per request depending on registered service providers) matters too, but it is a secondary concern. You could eliminate it entirely and still be bottlenecked by the LLM wait. Fix the I/O model first.

How Octane Changes the Request Model

Laravel Octane boots the framework once, keeps it in memory, and processes subsequent requests through persistent workers. The bootstrap overhead disappears from the per-request cost. Service providers register once. Routes load once. Configuration parses once. For standard Laravel applications this alone delivers 2.5–3x better throughput.

For AI workloads, Swoole specifically adds something PHP-FPM cannot: coroutine-based concurrency. When a Swoole worker dispatches an outbound HTTP call to an LLM provider, it can yield execution and switch to another coroutine while the network wait resolves. A single Swoole worker can hold multiple concurrent LLM connections simultaneously. It is not multi-threading. It is cooperative concurrency within a single OS process, which is exactly the right model for I/O-bound workloads.

Three drivers are available. FrankenPHP is the simplest to deploy (a single Go binary, no extension required) and delivers around 2.5x the throughput of PHP-FPM. RoadRunner is a Go-based process manager, sits closer to FPM’s operational model, delivers 2.8x throughput, and maintains compatibility with Xdebug, Datadog, and New Relic. Swoole is a C++ PHP extension delivering 3.1x throughput with the coroutine model that makes I/O concurrency possible.

The rest of this article focuses on Swoole, because that is where the AI workload story is most compelling. RoadRunner is the right choice if your team depends on APM tooling: both Datadog and New Relic have documented compatibility issues with Swoole’s coroutine model and CLI execution environment. You lose the coroutine concurrency, but you gain full observability.

What the Benchmarks Actually Show

The figures below come from a Deploynix benchmark published April 2026, run on a Hetzner CX32 server (4 vCPU, 8GB RAM, PHP 8.4) against a representative Laravel application with Eloquent queries, cache reads, and Blade rendering. Not a hello-world route. The benchmark tests PHP-FPM at ten workers against four Octane workers across all three drivers. Multiple independent benchmarks corroborate the ratios: a 2.5–3x RPS improvement and 60–70% median latency reduction appear consistently. The absolute figures will vary with hardware and application complexity.

Requests per second — higher is better

PHP-FPM (10 workers) FrankenPHP (4 workers) RoadRunner (4 workers) Swoole (4 workers)
PHP-FPM 850 RPS, FrankenPHP 2100 RPS, RoadRunner 2350 RPS, Swoole 2600 RPS.

Source: Deploynix benchmark, April 2026 — Hetzner CX32 (4 vCPU, 8GB RAM, PHP 8.4). Representative Laravel application with Eloquent, cache reads, and Blade rendering.


Latency in milliseconds — lower is better

P50 latency P99 latency TTFB cached
PHP-FPM P50 45ms P99 250ms TTFB 25ms. FrankenPHP P50 18ms P99 95ms TTFB 5ms. RoadRunner P50 16ms P99 88ms TTFB 4ms. Swoole P50 14ms P99 78ms TTFB 3ms.

Source: Deploynix benchmark, April 2026 — Hetzner CX32 (4 vCPU, 8GB RAM, PHP 8.4). TTFB measured on cached responses. P99 values reflect tail latency across the full test window.


The P99 column tells the more important story for AI applications. A P99 of 250ms under PHP-FPM against 78ms under Swoole means tail latency is 3x worse even on your fastest requests. Under PHP-FPM, framework bootstrap is a fixed cost paid even on requests that do no real work. Under Octane, that cost is amortised across the worker’s entire lifetime.

The TTFB improvement (25ms to 3ms for cached responses) has a direct impact on AI pipelines that use prompt result caching. If your application caches LLM outputs by prompt hash, the overhead of returning a cache hit under PHP-FPM is an order of magnitude higher than under Swoole. At volume, this shows up as a surprisingly high latency floor on what should be trivially fast responses.

The Request Lifecycle Under AI I/O Load

The performance difference becomes concrete when you trace both lifecycles for an actual LLM API call. The divergence at bootstrap is real but manageable. The divergence at the I/O wait step is structural: PHP-FPM blocks the worker for the full duration, Swoole yields the coroutine and makes the worker available to handle other work.

PHP-FPM vs Octane/Swoole request lifecycle for AI workloads Side-by-side comparison of PHP-FPM and Laravel Octane Swoole request lifecycles. The critical divergence is at the LLM I/O wait step where PHP-FPM blocks the worker and Swoole yields the coroutine to handle other requests. PHP-FPM Octane + Swoole Nginx receives request Bootstrap Laravel 10–30ms per request Clone app container Sub-millisecond Routes & middleware LLM request dispatched Routes & middleware LLM request dispatched Worker blocks Tied up for entire I/O wait Coroutine yields Handles other requests Response processed Response processed FPM overhead FPM I/O block Swoole advantage

At thirty concurrent users, each waiting five seconds for an LLM response, the structural difference at that step is the gap between a responsive API and a queued backlog. Swoole workers doing I/O work are not idle. They are serving other coroutines. FPM workers doing I/O work are doing nothing and blocking queue positions for everyone behind them.

Memory Behaviour: Paying the Right Price

Octane workers hold more RAM per process than FPM workers because the bootstrapped application stays resident. The per-worker numbers look concerning until you compare total deployment memory against throughput delivered.

Memory vs throughput trade-off

Per-worker memory (MB) Total deployment memory (MB)  RPS achieved (right axis)
PHP-FPM 40MB per worker 400MB total 850 RPS. FrankenPHP 75MB per worker 300MB total 2100 RPS. RoadRunner 70MB per worker 280MB total 2350 RPS. Swoole 82MB per worker 328MB total 2600 RPS.

Per-worker values are approximate midpoints of benchmark-reported ranges. Total = per-worker × worker count (10 for FPM, 4 for Octane drivers). Source: Deploynix benchmark, April 2026.


FrankenPHP and RoadRunner both use less total memory than PHP-FPM while delivering 2.5–2.8x the throughput. Swoole’s four workers average around 328MB, roughly 17% less than FPM’s ten-worker deployment at 400MB, while delivering 3x the output. You are not paying more to run Octane. You are paying less for dramatically better results.

The memory risk specific to AI workloads is static state accumulation. LLM response objects can be large. If any part of your application stores response data in static properties or singleton services that are not properly scoped, that memory does not get released between Octane requests. Under PHP-FPM it would, because the process resets. Under Octane it accumulates until the worker recycles.

Always set --max-requests when running Octane on AI workloads. Periodic worker recycling is not optional maintenance here. It is mandatory hygiene when response payload sizes are unpredictable.

php artisan octane:start --server=swoole --workers=4 --max-requests=500

[Architect's Note] The right --max-requests value depends on your average LLM response payload size. For workloads regularly processing multi-kilobyte completions, 500 is a conservative starting point. Monitor worker memory growth with Laravel Telescope or your APM tooling and adjust downward if you see steady accumulation between recycles.

Worker Persistence and Concurrent LLM Calls

Octane::concurrently() dispatches multiple closures as Swoole coroutines and collects their results in parallel, within a single request lifecycle. This method is Swoole-specific: it will throw a runtime exception on FrankenPHP and RoadRunner. If you are using either of those drivers, use Laravel’s HTTP client pool (Http::pool()) for concurrent outbound calls instead.

use Laravel\Octane\Facades\Octane;

[$completion, $embedding] = Octane::concurrently([
    fn () => $this->claudeClient->complete($prompt),
    fn () => $this->embeddingService->vectorise($prompt),
]);

Both calls dispatch simultaneously. The worker does not wait for the first to finish before starting the second. For a request requiring both a completion and an embedding vector (common in RAG-adjacent pipelines), this cuts the combined wait to the duration of the slower call, not their sum. On typical latencies, that is a 40–60% reduction in total request time for that operation.

The same pattern works for multi-provider comparison: dispatching identical prompts to two providers concurrently and selecting the faster or better-structured response. No queue overhead, no external orchestration.

[Architect's Note] Octane::concurrently() runs closures in separate Swoole coroutines, not separate processes. They share the worker’s memory space. Pass identifiers into the closures and resolve fresh instances inside, rather than capturing large Eloquent models or full response objects. The copy-on-write behaviour of PHP arrays means large in-closure copies are more expensive than they appear.

The Octane cache, backed by Swoole Tables and shared across all workers on the server, is a second persistence benefit. For prompt templates, system instructions, and model configuration that are read on every request but updated rarely, the Octane cache eliminates a Redis round-trip entirely.

// config/cache.php — add the octane store
'octane' => [
    'driver' => 'octane',
],
use Illuminate\Support\Facades\Cache;

Cache::store('octane')->put('system-prompt:classifier', $systemPrompt, 3600);
$prompt = Cache::store('octane')->get('system-prompt:classifier');

This is particularly useful for applications following the patterns in Production-Grade AI Architecture in Laravel, where prompt versioning is managed through a central service layer. Caching the resolved prompt at the worker level removes one Redis call per inference request without sacrificing central update capability.

Queues Do Not Run on Octane Workers

This is the most common misunderstanding when teams evaluate Octane for AI workloads. Octane manages HTTP workers. It has no relationship to your queue workers.

Laravel Horizon, which supervises your AI job queues, runs under its own process model. Those workers are standard PHP processes managed by Horizon’s supervisor, not Octane workers. The throughput and latency improvements Octane delivers do not extend to queued jobs.

The correct architecture for long-running AI operations uses Octane for the HTTP handoff and the queue for execution. Octane handles the API request quickly, dispatches a job to Horizon, and returns a job ID. The client polls for completion or subscribes to updates via SSE. Octane’s role is the fast HTTP surface, not the LLM execution itself.

[Architect's Note] Any LLM call that regularly exceeds 30 seconds belongs in the queue, not in an HTTP worker. Laravel’s default request timeout terminates the response before the LLM finishes. Octane does not change this ceiling. If you are evaluating Octane specifically to handle long-running AI operations over HTTP, you are solving the wrong problem.

Cold Starts in Containerised Deployments

PHP-FPM has no meaningful cold start problem. A new worker handles its first request including the full framework bootstrap, and the extra 10–30ms is indistinguishable from steady-state at any realistic monitoring granularity.

Octane’s cold start is different. Each worker must bootstrap the full application before it can serve its first request. For a four-worker Swoole deployment on a typical Laravel AI application, that means four parallel bootstraps at startup: 500ms–2s before any request is served.

In Kubernetes environments where pods scale on demand, this matters. A pod scaling up under a traffic spike is unavailable for those first seconds. The mitigations are standard: keep minimum replicas above zero, use a readiness probe that confirms Octane is accepting connections before traffic routes to the pod, and pre-warm pods during off-peak periods if your scaling pattern allows it. The production deployment guide covers readiness probe configuration for Laravel on Kubernetes specifically.

When PHP-FPM Is Still the Right Call

Octane is not universally correct. There are conditions where staying on PHP-FPM is the right engineering decision.

Below roughly 500 requests per minute, the latency difference is imperceptible to users. Most perceived latency at that scale is network round-trip and frontend rendering. The engineering cost of an Octane migration, including the static state audit below, is not justified by a performance gain the user cannot detect.

If your team depends on Datadog APM or New Relic for production visibility, Swoole creates genuine compatibility friction. Both products have documented issues with Swoole’s coroutine model and CLI execution environment. RoadRunner is a better choice here and you retain the throughput gains, though you lose the coroutine concurrency model that makes concurrent LLM calls possible.

Package compatibility is worth an honest audit before committing. The PHP ecosystem has largely adapted to Octane, but older or less-maintained packages may still store request-scoped state in static properties. Finding this in production rather than during the audit is a costly discovery.

[Word to the Wise] Teams that regret migrating to Octane almost always skipped the static state audit. It is a two-hour exercise on a typical Laravel codebase. Do it before you deploy, not after you start seeing data leaks or creeping memory growth in production.

Making the Switch: What Actually Breaks

The migration checklist for AI applications differs from the standard Octane checklist in one important way: LLM clients and response objects are larger and more likely to carry problematic static state.

Static properties on AI service classes. Any class storing LLM response data, token counts, or conversation history in a static property will leak state across requests. Audit every class in app/Services/ that touches your AI layer.

class AiResponseCache
{
    private static array $responses = [];

    public static function store(string $key, string $response): void
    {
        self::$responses[$key] = $response;
    }
}

This pattern works under PHP-FPM because the process resets. Under Octane, $responses persists across requests on the same worker. One user’s cached completion leaks into another user’s request context. Replace static storage with a scoped() service binding that resets at request boundaries.

LLM clients bound as singletons. If your LLM client is bound as a singleton() in a service provider and the underlying HTTP client holds connection state, that state persists across requests. Bind AI clients as scoped() instead.

// In App\Providers\AiServiceProvider::register()
// Register AiServiceProvider in bootstrap/app.php
$this->app->scoped(ClaudeClient::class, function ($app) {
    return new ClaudeClient(
        apiKey: config('services.claude.api_key'),
        model: config('services.claude.model'),
    );
});

The Laravel AI service layer pattern uses scoped() bindings by default for this exact reason. If your application already follows that architecture, verify rather than assume.

Conversation context services. Services tracking multi-turn conversation history need explicit cleanup in the RequestHandled event, registered in a service provider’s boot() method.

use Laravel\Octane\Events\RequestHandled;
use Illuminate\Support\Facades\Event;

// In App\Providers\AiServiceProvider::boot()
Event::listen(RequestHandled::class, function () {
    app(ConversationContext::class)->flush();
});
// In App\Providers\AiServiceProvider::register()
// Bind ConversationContext as scoped so it is resolvable and resets per request
$this->app->scoped(ConversationContext::class, fn () => new ConversationContext());

For token tracking and rate limiting middleware, review your AI middleware implementations and confirm they rely on request-scoped state rather than static counters. A static token counter accumulates across requests on the same Octane worker indefinitely.

Before deploying, run Octane locally in development mode (--watch restarts workers on file changes and must never be used in production) and load-test every AI endpoint for several minutes. Memory leaks that take an hour to surface in production appear within minutes under local concurrency.

One deployment step that becomes mandatory under Octane: always run php artisan config:cache and php artisan route:cache before starting Octane in production. Octane serves from in-memory state and will not pick up runtime file changes. An uncached config change deployed mid-traffic will not be visible to running workers until they are restarted.

[Production Pitfall] Teams that migrate to Octane without auditing their AI service layer almost always encounter creeping memory growth within the first week of production traffic. The symptom is P99 latency climbing gradually over hours, stabilising after worker recycling. Static LLM response accumulation is the first place to look.

Conclusion

The benchmark data makes a clear case. Octane with Swoole delivers 3x the throughput of PHP-FPM at comparable or lower total memory cost, with the coroutine model resolving the specific problem PHP-FPM cannot: making a worker productive during LLM I/O wait rather than idle.

Whether to migrate comes down to two questions. First, does your traffic volume make the gains meaningful? Below 500 requests per minute, probably not. Above that, the infrastructure savings are real. Second, is your AI service layer written with request isolation in mind? If you have followed the production AI integration architecture patterns and your LLM clients are scoped rather than singleton, migration is a day’s work. If static state is scattered through your service layer, the audit takes longer. Do it regardless of whether you migrate. That state causes subtle bugs under PHP-FPM too.


Frequently Asked Questions

Does Laravel Octane affect Laravel Horizon queue workers?

No. Octane manages HTTP workers only. Horizon runs its own PHP process pool under a separate supervisor. The throughput and concurrency improvements Octane delivers do not extend to queued jobs. Your AI batch jobs, embedding pipelines, and async LLM operations continue to run under Horizon’s standard process model regardless of whether Octane is installed.

Does Octane::concurrently() work with FrankenPHP or RoadRunner?

No. Octane::concurrently() uses Swoole coroutines internally and is Swoole-specific. On FrankenPHP or RoadRunner it will throw a runtime exception. If you need concurrent outbound LLM calls without Swoole, dispatch them via Laravel’s HTTP client pool (Http::pool()) inside a standard request, which uses Guzzle promises rather than coroutines.

How do I know if my application has static state issues before migrating?

Run php artisan octane:install then start Octane locally with --watch. Make two consecutive requests to every AI endpoint from the same worker (you can force this by setting --workers=1 temporarily). If response data or token counts from request one bleed into request two, you have static state. Search your codebase for static $, static array, and static string properties on any class in app/Services/, app/AI/, or equivalent directories.

Is the Octane cache shared between workers?

Yes. The Octane cache is backed by Swoole Tables, which are shared memory structures accessible to all workers on the same server. Data written by worker one is immediately visible to workers two, three, and four. This makes it suitable for shared prompt templates and model configuration. It is not suitable for request-scoped data: anything that should be private to a single request must use the request lifecycle, not the Octane cache.

What is a safe --max-requests value for AI workloads?

Start at 500 and monitor worker memory over several hours of production traffic using Telescope or your APM tooling. If you see steady memory growth within a recycle window, reduce the value. If workers are recycling frequently with no visible memory trend, increase it. There is no universal correct value: it depends on the size and frequency of your LLM responses. Applications processing large streaming outputs should be more conservative than those handling short structured completions.

Dewald Hugo

A software architect with 15+ years of experience in the PHP and Laravel ecosystem. Dewald created Origin Main to provide the engineering rigour required to integrate AI into professional, high-concurrency production systems. He writes for developers who care less about "getting it to work" and more about "getting it to last".

Subscribe
Notify of
0 Comments
Oldest
Newest Most Voted
Quick Navigation
Scroll to Top