laravel horizon in production

Laravel Horizon in Production: Configuring AI Queue Workloads That Actually Hold

Last reviewed: May 2026

Laravel Horizon in production looks deceptively simple until your first LLM inference job times out silently and your users start receiving empty responses. Standard queue jobs (sending emails, processing images, syncing records), complete in milliseconds. AI inference jobs do not. A cold claude-sonnet-4-6 call with a dense system prompt can run for 45 seconds. A gemini-2.5-pro batch summarisation job can breach two minutes under load. Horizon’s defaults were not built for this, and the failure modes are nasty: jobs that disappear without landing in failed_jobs, rate limit retries that exhaust the tries budget in under 30 seconds, and expensive inference work discarded mid-completion.

This guide is part of the AI Deployment & Production Operations module, which covers the full surface area of running Laravel AI applications in production. If you are still wiring up the surrounding deployment infrastructure, the complete production deployment guide is the right starting point.

What follows covers the three layers where AI queue workloads require deliberate configuration: supervisor setup, job class design, and operational monitoring.

Why AI Jobs Break Standard Horizon Assumptions

Standard Horizon configuration assumes workers cycle through jobs in seconds. The defaults reflect that: 60-second timeouts, three retries with no backoff configuration, and supervisor settings tuned for throughput. Those assumptions collapse the moment you start queuing LLM inference.

Three failure modes come up repeatedly.

Silent timeout kills. Horizon’s default timeout of 60 seconds is aggressive for AI inference. A gpt-4o call with a large context window can sit at 50 seconds before returning its first token. Add network variance and the worker process receives a SIGKILL mid-call. No exception is logged. The job does not land in failed_jobs. It just vanishes. This is the most common support ticket pattern we see from teams that have not tuned Horizon for AI: “jobs are disappearing.”

Rate limit mishandling. Provider 429 responses from OpenAI, Anthropic, and Google are not errors in the traditional sense. They are expected, temporary, and recoverable. Retrying immediately burns through the tries budget in seconds. Without a backoff array defined on the job, Laravel uses zero delay between retries by default. A job hitting a rate limit five times in 15 seconds has failed just as permanently as one that hit a genuine error.

Partial output loss. AI jobs often do useful work before failing. A document summarisation job might process 80% of its input before hitting a context limit. Standard job failure handling discards that state entirely. For expensive inference workloads on long documents, that is a measurable cost.

The fix requires changes at three levels: supervisor configuration, job class design, and monitoring.

Configuring Laravel Horizon in Production for AI Workloads

Install Horizon if you have not already:

composer require laravel/horizon
php artisan horizon:install

The critical configuration lives in config/horizon.php. The default supervisor configuration is intentionally generic. For AI workloads, you need a dedicated supervisor pool with materially different settings.

// config/horizon.php

'environments' => [
    'production' => [

        'supervisor-ai-inference' => [
            'connection'          => 'redis',
            'queue'               => ['ai-high', 'ai-default', 'ai-low'],
            'balance'             => 'auto',
            'autoScalingStrategy' => 'time',
            'minProcesses'        => 2,
            'maxProcesses'        => 12,
            'balanceMaxShift'     => 2,
            'balanceCooldown'     => 5,
            'timeout'             => 300,  // 5 minutes — covers streaming completions
            'sleep'               => 3,
            'tries'               => 5,
            'nice'                => 0,
        ],

        'supervisor-default' => [
            'connection'  => 'redis',
            'queue'       => ['default', 'notifications', 'mail'],
            'balance'     => 'simple',
            'minProcesses'=> 1,
            'maxProcesses'=> 8,
            'timeout'     => 60,
            'sleep'       => 3,
            'tries'       => 3,
        ],
    ],
],

A few decisions here that are worth explaining.

autoScalingStrategy: time scales workers based on queue wait time rather than queue size. For AI workloads, queue size is a poor signal: three jobs waiting sounds manageable, but if each takes 90 seconds, you are looking at a 4-minute tail for the last user. Time-based scaling catches this earlier.

timeout: 300 gives generous headroom for streaming completions and large context calls. This is not a ceiling you should approach routinely; it is a safety net. If jobs are regularly running past 120 seconds, that is a prompt engineering problem, not a timeout problem.

balanceCooldown: 5 prevents the auto-balancer from thrashing worker counts during a burst of short AI calls followed by a trough. Default is 3 seconds, which is too reactive for inference workloads.

Supervisord server configuration is equally important. The stopwaitsecs value must exceed Horizon’s timeout value, or the process manager will kill a running Horizon worker before it finishes a long inference job during deployments:

[program:laravel-horizon]
process_name=%(program_name)s
command=php /var/www/html/artisan horizon
autostart=true
autorestart=true
user=www-data
redirect_stderr=true
stdout_logfile=/var/www/html/storage/logs/horizon.log
stopwaitsecs=360

Set stopwaitsecs to at least timeout + 60. We have seen rolling deployments silently truncate in-flight inference calls because this was left at the default 10 seconds.

Designing the Job Class: Timeout, Retry, and Rate Limit Handling

The supervisor configuration sets the outer boundary. The job class defines behaviour within it. For AI inference jobs, three properties are non-negotiable: $timeout, $tries, and $backoff.

<?php

namespace App\Jobs;

use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Queue\Middleware\RateLimited;
use Illuminate\Support\Facades\Log;

class GenerateAIInsightJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    /**
     * Hard kill threshold. Horizon's supervisor timeout is the outer wall;
     * this property is the job's own declaration to the queue system.
     * Set it below the supervisor timeout to allow graceful error handling.
     */
    public int $timeout = 240;

    /**
     * Total attempts before the job is moved to failed_jobs.
     * 5 attempts with exponential backoff covers transient provider outages.
     */
    public int $tries = 5;

    /**
     * Seconds to wait before each retry attempt.
     * Indices correspond to attempt number: attempt 1 waits 30s, attempt 2 waits 60s, etc.
     */
    public array $backoff = [30, 60, 120, 180, 240];

    public function __construct(
        private readonly int    $documentId,
        private readonly string $prompt,
        private readonly string $model = 'claude-sonnet-4-6',
    ) {}

    public function middleware(): array
    {
        return [new RateLimited('ai-inference')];
    }

    public function handle(): void
    {
        $document = Document::findOrFail($this->documentId);

        try {
            $response = \Anthropic::messages()->create([
                'model'      => $this->model,
                'max_tokens' => 2048,
                'messages'   => [
                    ['role' => 'user', 'content' => $this->prompt],
                ],
            ]);

            $document->update([
                'ai_insight'       => $response->content[0]->text,
                'insight_model'    => $this->model,
                'insight_token_count' => $response->usage->inputTokens + $response->usage->outputTokens,
            ]);

        } catch (\Throwable $e) {
            if ($this->isRateLimitException($e)) {
                // Release back to queue with the appropriate backoff delay.
                // Do NOT throw — throwing counts as a failed attempt.
                $this->release($this->backoff[$this->attempts() - 1] ?? 240);
                return;
            }

            Log::error('AI insight generation failed', [
                'document_id' => $this->documentId,
                'attempt'     => $this->attempts(),
                'error'       => $e->getMessage(),
            ]);

            throw $e;
        }
    }

    public function failed(\Throwable $exception): void
    {
        // Preserve whatever partial work exists rather than nulling it.
        Document::where('id', $this->documentId)->update([
            'ai_insight_status' => 'failed',
            'ai_insight_error'  => $exception->getMessage(),
        ]);

        Log::critical('AI insight job exhausted all retries', [
            'document_id' => $this->documentId,
            'model'       => $this->model,
        ]);
    }

    public function retryUntil(): \DateTime
    {
        // Absolute deadline. Even with $tries remaining, the job will not
        // retry after this point. Critical for time-sensitive inference pipelines.
        return now()->addHours(3);
    }

    private function isRateLimitException(\Throwable $e): bool
    {
        return str_contains($e->getMessage(), '429')
            || str_contains($e->getMessage(), 'rate_limit')
            || str_contains($e->getMessage(), 'Too Many Requests');
    }
}

The $this->release() pattern inside isRateLimitException is the correct approach for provider rate limits. Throwing the exception counts as a failed attempt and triggers a retry cycle. Calling release() puts the job back on the queue with a delay and does not decrement the tries counter. Rate limits are not job failures; they are scheduling signals.

retryUntil() is the safety valve. Without it, a job sitting in exponential backoff across five attempts could theoretically retry for hours after the result is no longer needed. Set this to match the actual business requirement.

Registering the Rate Limiter

The RateLimited middleware on the job references a named rate limiter. Register it in your AppServiceProvider:

// app/Providers/AppServiceProvider.php

use Illuminate\Cache\RateLimiting\Limit;
use Illuminate\Support\Facades\RateLimiter;

public function boot(): void
{
    RateLimiter::for('ai-inference', function (object $job) {
        // Adjust this to match your provider tier.
        // Anthropic Tier 2: ~1,000 RPM. OpenAI Tier 3: ~5,000 RPM.
        // Start conservative and increase as you verify throughput.
        return Limit::perMinute(60)->by('global');
    });
}

For multi-tenant applications with per-tenant provider keys, scope the limiter by tenant:

RateLimiter::for('ai-inference', function (object $job) {
    $tenantId = $job->tenantId ?? 'global';
    return Limit::perMinute(30)->by("tenant:{$tenantId}");
});

If you are building the broader governance layer around token budgets and per-tenant limits, the AI middleware article covering token tracking covers the HTTP request layer equivalent of this pattern.

Job Duration: Why This Visualisation Matters

The chart below illustrates why AI inference jobs cannot share a supervisor pool with standard queue jobs. Standard jobs cluster almost entirely below 200ms. AI inference jobs distribute across a 5-to-90-second range, with a meaningful tail. A shared 60-second timeout kills the tail of the AI distribution silently.

Bar chart comparing job duration distribution between standard queue jobs and AI inference jobs, illustrating why a shared 60-second timeout is insufficient for AI workloads.

Standard jobs AI inference jobs Default 60s timeout
Standard jobs: 0–200ms 88%, 200ms–1s 10%, 1–5s 2%, 5–15s 0%, 15–60s 0%, 60s+ 0%. AI inference: 0–200ms 1%, 200ms–1s 3%, 1–5s 8%, 5–15s 38%, 15–60s 42%, 60s+ 8%.

Monitoring What Actually Matters for AI Queues

The Horizon dashboard gives you throughput, wait time, and recent job runtime out of the box. For standard workloads, those three numbers tell you most of what you need. For AI inference workloads, the signal you need most is not surfaced by default: the ratio of jobs that exit via SIGKILL versus jobs that complete normally.

The table below outlines which Horizon metrics carry real weight for AI workloads, and the thresholds worth building alerts around.

MetricDefault thresholdAI workload thresholdWhy it differs
Job wait timeAlert at 30sAlert at 120sAI jobs are slower (short wait time spikes are normal)
Job runtime (p95)Alert at 10sAlert at 90sLong completions are expected; watch the tail, not the mean
Failed job rateAlert at 5%Alert at 2%Expensive inference; failures cost more than compute
Retry rateNot monitoredAlert at 15%High retries indicate a rate limit or model instability problem
Queue depth (ai-high)Alert at 50Alert at 10High-priority AI jobs should process near-immediately

For the silent timeout kill problem, there is no native Horizon alert. The symptom is a job that leaves no trace in failed_jobs despite not completing. You can detect this indirectly by tracking job dispatch counts against completion counts in your application layer:

// Dispatch side — record intent
Cache::increment("jobs:dispatched:{$this->documentId}");
GenerateAIInsightJob::dispatch($document->id, $prompt)->onQueue('ai-default');

// Handle side — record completion
Cache::increment("jobs:completed:{$this->documentId}");

A growing gap between those two counters, without a corresponding growth in failed_jobs, is the fingerprint of silent SIGKILL kills. Add a scheduled command to routes/console.php that alerts when the gap exceeds a threshold:

// routes/console.php

use Illuminate\Support\Facades\Schedule;

Schedule::call(function () {
    $dispatched  = Cache::get('jobs:dispatched:total', 0);
    $completed   = Cache::get('jobs:completed:total', 0);
    $failed      = DB::table('failed_jobs')
                     ->where('queue', 'like', 'ai-%')
                     ->where('failed_at', '>=', now()->subHour())
                     ->count();

    $missing = $dispatched - $completed - $failed;

    if ($missing > 5) {
        Log::critical('AI jobs disappearing without trace', [
            'dispatched' => $dispatched,
            'completed'  => $completed,
            'failed'     => $failed,
            'missing'    => $missing,
        ]);
    }
})->everyFiveMinutes();

[Production Pitfall] The silent SIGKILL kill is by far the most dangerous failure mode in AI queue workloads, because it produces no actionable output. Teams routinely run in this state for weeks without realising it, attributing the missing outputs to “the AI being slow.” Check your stopwaitsecs and timeout alignment before anything else. If stopwaitsecs in your supervisord config is lower than Horizon’s timeout, every deployment is silently truncating in-flight inference calls.

If you are building the broader observability layer around your AI architecture, the governance and telemetry patterns in the production AI architecture guide cover how to centralise this kind of cross-cutting operational signal across providers.

Failed Job Strategy for LLM Inference

When a job does land in failed_jobs, the default response is to re-dispatch it manually via php artisan queue:retry and hope the error was transient. For AI inference, that is rarely sufficient. Inference failures tend to cluster around specific causes (provider outages, malformed prompts, context window overflows, or inference parameters that produce invalid output), and each warrants a different response.

Structure your failed job handling to capture enough context to triage correctly:

public function failed(\Throwable $exception): void
{
    $reason = match (true) {
        str_contains($exception->getMessage(), 'context_length_exceeded') => 'context_overflow',
        str_contains($exception->getMessage(), '429')                     => 'rate_limit_exhausted',
        str_contains($exception->getMessage(), 'invalid_request_error')   => 'bad_prompt',
        default                                                           => 'unknown',
    };

    Document::where('id', $this->documentId)->update([
        'ai_insight_status' => 'failed',
        'ai_failure_reason' => $reason,
        'ai_insight_error'  => $exception->getMessage(),
    ]);

    Log::error('AI inference job failed permanently', [
        'document_id' => $this->documentId,
        'model'       => $this->model,
        'reason'      => $reason,
        'attempts'    => $this->attempts(),
    ]);
}

The reason field is the important addition. context_overflow failures need prompt truncation logic, not a retry. bad_prompt failures need a developer looking at the prompt template, not an automated re-queue. Retrying them blindly burns provider quota for no benefit.

For agentic pipelines where the inference job is one step in a multi-step chain, the failure handling becomes more complex. The question of what to do with downstream jobs that depend on the failed step, and how to validate the partial output that does exist, is covered in depth in the agentic workflow schema validation guide. The core principle applies here too: validate what you have before deciding whether a retry is warranted.

One final note on the Horizon dashboard and Laravel Telescope together: Telescope’s job watcher captures the full job payload, exception stack trace, and timing for every failed job. For AI workloads, the job payload includes the prompt, which makes post-mortem analysis significantly faster. Enable the job watcher in non-production environments at minimum, and consider enabling it in production with payload scrubbing for anything containing PII. See the Laravel Horizon documentation for tag-based filtering, which lets you isolate AI job failures without noise from other queue traffic. Anthropic’s rate limit documentation is the authoritative reference for per-tier RPM and token-per-minute limits if you are calibrating your RateLimiter::for('ai-inference') ceiling.


Frequently Asked Questions

Why does my AI job disappear without appearing in failed_jobs?

The most common cause is a SIGKILL from the process manager before the job completes, which happens when Horizon’s timeout value exceeds the stopwaitsecs value in your supervisord config. The worker process is killed mid-execution before Laravel can record the failure. Increase stopwaitsecs to at least timeout + 60 and redeploy.

Should I use a separate Redis connection for AI queues?

If your AI queue volume is high enough to compete with your default Redis connection, yes. Separate connections give you independent memory limits, separate monitoring, and the ability to flush one queue without affecting the other. Define a second connection in config/database.php under the redis driver and reference it in the supervisor config via the connection key.

What is the right tries value for LLM inference jobs?

Five is a reasonable default for transient provider errors. More importantly, pair it with retryUntil() to set an absolute deadline. Five retries with a [30, 60, 120, 180, 240] backoff array can span over 10 minutes of wall-clock time — fine for a background summarisation task, not acceptable for a user-facing generation pipeline. retryUntil() is the correct control for time-sensitive inference jobs.

Can I share one Horizon supervisor across AI and standard jobs?

You can, but the timeout setting applies per-supervisor, so you end up with either a 60-second ceiling that kills AI jobs or a 300-second ceiling that lets runaway standard jobs tie up workers for five minutes. Separate supervisors with different timeout values is the correct approach.

How do I handle token streaming inside a queued job?

Token streaming over SSE is a transport layer concern, not a queue concern. Inside a queued job, you are collecting the full completion synchronously and persisting it. If you need to surface partial token streaming to a user in real time, that is a separate HTTP request with SSE — the queue job handles the asynchronous persistence layer, and SSE handles the real-time delivery layer. Mixing them in a single job is an architecture problem, not a configuration one.

Subscribe
Notify of
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Navigation
Scroll to Top