guide

Error Recovery Patterns for Production Agents

Your agent will fail. The only question is whether it recovers or crashes.

Why this matters more than you think

Every agent tutorial shows the happy path. The LLM responds correctly, the tool returns data, the workflow completes. In production, roughly 15-25% of agent runs encounter at least one failure — a rate-limited API, a malformed LLM response, a tool timeout, a context window overflow.

The difference between a prototype and a product is what happens during that 15-25%.

Pattern 1: Structured retry with exponential backoff

The simplest pattern, and the one most builders implement incorrectly.

The wrong way: Retry immediately, same parameters, same prompt. This works for transient network errors and nothing else. If the LLM produced a malformed response, retrying with the same prompt gives you the same malformed response.

The right way: Vary the retry strategy based on the error type:

Rate limit (429): Exponential backoff with jitter. Respect Retry-After headers.
Malformed output: Re-prompt with the error appended to the context. "Your previous response was invalid JSON. Here was the error: {error}. Please try again."
Tool failure: Skip the tool and degrade gracefully. "The search API is unavailable. I'll answer based on my existing knowledge, with the caveat that this information may not be current."
Context overflow: Summarize earlier context and retry with a shorter prompt.

Max retries: Three. After three failures, escalate or fail loudly. Silent infinite retry loops are the single most common production agent bug.

Pattern 2: Checkpoint and resume

Long-running agent workflows — research pipelines, multi-step analysis, code generation with testing — can take minutes or hours. A crash at step 7 of 10 shouldn't restart from step 1.

Implementation:

After each major step, serialize the agent's state (current step, accumulated results, remaining tasks) to durable storage.
On restart, load the last checkpoint and resume from there.
LangGraph's state persistence does this natively. If you're not using LangGraph, implement it with a simple state table: workflow_id, step, state_json, created_at.

The trap: Don't checkpoint too frequently. Every checkpoint is a write to storage. In high-throughput systems, this becomes a bottleneck. Checkpoint at logical boundaries (after a tool call completes, after a decision is made), not after every LLM call.

Pattern 3: Fallback chains

When your primary approach fails, have a backup.

Examples:

LLM fallback: Primary model (Claude Opus) → fallback model (Claude Sonnet) → cached response. Each step is cheaper and faster, but less capable.
Tool fallback: Web search API → cached results → LLM knowledge. Degrade from real-time to stale data rather than failing entirely.
Strategy fallback: ReAct reasoning → direct prompting → template response. Each step is simpler and more predictable.

Key rule: The fallback must be strictly simpler than the primary. If your fallback is equally complex, it will fail for the same reasons.

Pattern 4: Circuit breakers

Borrowed from microservices architecture. When a dependency fails repeatedly, stop calling it.

How it works:

Track failure rate for each tool and each LLM endpoint.
When failures exceed a threshold (e.g., 5 failures in 60 seconds), "open" the circuit — all calls to that dependency immediately return an error without actually calling it.
After a cooldown period, allow one test call. If it succeeds, close the circuit.

Why this matters for agents: Without a circuit breaker, a failing tool can cascade into hundreds of wasted LLM calls. The agent keeps trying to use the tool, the tool keeps failing, and each retry burns tokens and time.

Implementation tip: Most agent frameworks don't include circuit breakers. You'll need to implement this yourself or use a library. The state can be as simple as a counter in memory — you don't need a distributed circuit breaker for a single-agent system.

Pattern 5: Supervisory agents

For multi-agent systems, designate one agent as the supervisor. Its only job is to monitor the other agents and intervene when things go wrong.

What the supervisor does:

Watches for agents that are stuck (no output for N seconds)
Kills agents that exceed their token budget
Reassigns tasks from failed agents to backup agents
Aggregates error reports and decides whether to retry the entire workflow or escalate to a human

What the supervisor doesn't do:

The supervisor should never attempt to fix the failing agent's output. That's a recipe for cascading errors. Its job is routing and resource management, not content correction.

LangGraph's supervisor pattern implements this cleanly. The supervisor is a node in the graph that receives state updates from all workers and controls the routing.

Error observability

None of these patterns work without observability. You need to know:

Which errors are happening (error type, frequency, affected tools)
When they started (did a deployment trigger a spike?)
How they resolved (retry succeeded? fallback used? human escalated?)

Langfuse and Arize Phoenix both support agent-level tracing with error annotation. At minimum, log every error with: timestamp, agent_id, tool_name, error_type, error_message, retry_count, resolution.

The hierarchy

When designing your error strategy, think in layers:

1. Retry — the error is transient, try again

2. Fallback — the approach failed, try a different approach

3. Circuit break — the dependency is down, stop trying

4. Escalate — the system can't recover, involve a human

5. Fail loudly — nothing worked, return an error to the user with context

Every production agent should have all five layers. Most have only the first one. That's why most production agents feel unreliable.

Sources