Sequential Chaining: How to Build Multi-Step AI Agent Pipelines That Actually Work
The simplest orchestration pattern is also the most reliable. Here's how to build it right.
The pattern nobody explains properly
Every agent tutorial shows a single LLM call: prompt in, answer out. But production agent systems almost never work that way. They work in stages. A research agent gathers data, a second agent analyzes it, a third synthesizes the analysis, and a fourth formats the final output. Each stage receives the previous stage's output and nothing else.
That's sequential chaining. Output of step N becomes the input to step N+1. No branching. No parallelism. Just a pipeline.
It sounds trivially simple. It is not. The difference between a sequential chain that works in demos and one that works at scale comes down to how you handle context, errors, and the contracts between steps.
Why sequential chaining dominates production
Three reasons this pattern shows up everywhere:
Focused context. Each step in the chain sees only what it needs. A research agent doesn't need to know about formatting requirements. A formatting agent doesn't need raw search results. This constraint is a feature — it prevents context window bloat and keeps each LLM call precise.
Debuggability. When step 4 produces garbage, you inspect step 3's output. The bug is always at the boundary between two steps. Compare this to a monolithic prompt that tries to research, analyze, synthesize, and format in one pass — when that fails, you're staring at a 4,000-token output with no idea where the reasoning went wrong.
Composability. Steps are interchangeable. Swap out the research agent for a different one, and the rest of the chain doesn't change. Upgrade one step to a better model without touching the others. This modularity is what separates production systems from prototypes.
Real-world examples
Claude Code Review's 8-step pipeline. Anthropic's code review system processes pull requests through 8 sequential stages. The PR is parsed and diffed. The diff is chunked. Each chunk is analyzed for issues. Issues are scored for confidence (0-100, threshold of 80 or above to surface). Results are aggregated. Duplicates are removed. The final review is formatted and posted. Each stage feeds precisely into the next — and this pipeline took their substantive comment rate from 16% to 54%.
Codex's turn lifecycle. OpenAI's Codex agent runs a 10-step sequential process for every user request: parse the instruction, plan the approach, select tools, execute tool calls, verify results, handle errors, summarize findings, format the response, run guardrails, and deliver. Each step is a distinct function with defined inputs and outputs.
The pattern is the same in both cases: linear flow, clean handoffs, inspectable boundaries.
When to use it
Sequential chaining is the right choice when:
- Steps have true dependencies. You can't analyze data you haven't gathered yet. You can't format a report you haven't written. If step N+1 genuinely requires step N's output, this is your pattern.
- Debuggability matters more than speed. Sequential chains are inherently slower than parallel execution — you're adding latency at every step. But the ability to inspect each intermediate result is worth it for complex workflows.
- You're building incrementally. Start with a 2-step chain (research then format). Add steps as you learn. This beats designing a 10-step pipeline upfront and debugging it all at once.
Sequential chaining is the wrong choice when:
- Steps are independent. If step 3 and step 4 don't need each other's output, you're adding latency for no reason. Use [parallel fan-out](/guides/parallel-fan-out-ai-agents) instead.
- Latency is the primary constraint. Each step adds at least one LLM call. A 6-step chain at 2 seconds per call is 12 seconds minimum. If your user is waiting, that's too long.
- Failure of one step shouldn't block everything. In a sequential chain, step 3 failing means steps 4-8 never execute. If partial results are acceptable, you need a different pattern.
Implementation patterns
The context object
The key abstraction is a context object that flows through the pipeline. Each step reads from it and writes to it.
interface PipelineContext {
query: string;
research?: ResearchResult[];
analysis?: AnalysisResult;
synthesis?: string;
report?: string;
errors: StepError[];
metadata: { startedAt: number; stepTimings: Record<string, number> };
}
type PipelineStep = (ctx: PipelineContext) => Promise<PipelineContext>;Each step is a function that takes the context, does its work, and returns the updated context. This gives you a clean contract: every step knows exactly what data it can read and what it's responsible for writing.
Structured outputs between steps
Don't pass free text between steps. Use structured outputs — JSON schemas that the LLM must conform to. This eliminates an entire class of bugs where step 3 can't parse step 2's output because the format changed.
const researchStep: PipelineStep = async (ctx) => {
const start = Date.now();
const result = await llm.generate({
model: "claude-sonnet-4-20250514",
system: "You are a research agent. Return structured findings.",
prompt: `Research this topic: ${ctx.query}`,
responseFormat: {
type: "json_schema",
schema: ResearchResultSchema,
},
});
ctx.research = result.parsed;
ctx.metadata.stepTimings["research"] = Date.now() - start;
return ctx;
};Error propagation
Every step can fail. The question is what happens next. Three strategies:
Fail-fast. Step fails, pipeline stops, error is reported. Best for pipelines where partial results are worthless.
Skip and continue. Step fails, its output is marked as unavailable, the next step adapts. Best for pipelines where partial results still have value.
Retry with modification. Step fails, the error is appended to the context, the step is retried with the error as additional input. Best for transient failures or malformed LLM outputs.
async function runPipeline(
steps: PipelineStep[],
ctx: PipelineContext,
options: { maxRetries: number; strategy: "fail-fast" | "skip" | "retry" }
): Promise<PipelineContext> {
for (const step of steps) {
let attempts = 0;
while (attempts < options.maxRetries) {
try {
ctx = await step(ctx);
break;
} catch (error) {
attempts++;
ctx.errors.push({ step: step.name, error, attempt: attempts });
if (options.strategy === "fail-fast") throw error;
if (options.strategy === "skip") break;
// "retry" continues the while loop
}
}
}
return ctx;
}The full pipeline
Putting it together — a research-to-report pipeline with four stages:
const pipeline: PipelineStep[] = [
researchStep, // Gather sources and raw data
analysisStep, // Extract key findings, identify patterns
synthesisStep, // Merge findings into a coherent narrative
reportStep, // Format into final deliverable
];
const result = await runPipeline(pipeline, {
query: "Current state of AI agent frameworks in 2026",
errors: [],
metadata: { startedAt: Date.now(), stepTimings: {} },
}, { maxRetries: 3, strategy: "retry" });Each step uses the model appropriate for its complexity. Research might use Sonnet for speed. Analysis might use Opus for reasoning depth. Formatting might use Haiku because it's cheap and the task is mechanical. This model routing across steps is where sequential chaining saves the most money.
What to build next
Sequential chaining is the foundation. Once you've built a reliable pipeline, the natural next step is adding [parallel fan-out](/guides/parallel-fan-out-ai-agents) for steps that don't depend on each other — running research across multiple sources simultaneously, then merging results back into the sequential flow.
For pipelines that handle sensitive decisions, add [human-in-the-loop](/guides/human-in-the-loop-ai-agents) gates between steps. The pipeline pauses at a checkpoint, a human reviews the intermediate result, and the pipeline resumes on approval.
Start sequential. Add complexity only when you have evidence it's needed.