Parallel Fan-Out: Running Multiple AI Agents Simultaneously to Cut Latency by 10x
Dispatch N independent tasks at once, collect all results, merge them. Here's the engineering.
The latency problem
A sequential pipeline with 5 steps at 3 seconds each takes 15 seconds. If those steps are independent — if step 3 doesn't need step 2's output — you're waiting 12 seconds for no reason.
Parallel fan-out solves this. Dispatch all 5 tasks simultaneously, collect results when they're all done. Total time: 3 seconds (the slowest task), not 15. That's a 5x improvement from changing the execution pattern, not the model or the prompt.
In practice, the gains are even larger. Anthropic's multi-agent research found that multi-agent architectures outperform single-agent approaches by 90.2% — and parallelism is a primary driver. The same research showed that token usage explains 80% of performance variance. Parallel agents spend their tokens on actual work, not waiting in line.
The independence requirement
Not everything can be parallelized. The rule is simple: if task B needs task A's output, they must be sequential. If they don't, they can be parallel.
Safe to parallelize:
- Reviewing different files in a code review (each file is independent)
- Searching multiple data sources for the same query
- Running different analysis perspectives on the same document
- Generating alternative drafts of the same content
Not safe to parallelize:
- Gathering data, then analyzing it (analysis needs the data)
- Writing a draft, then editing it (editing needs the draft)
- Making a decision, then executing it (execution needs the decision)
The mistake most builders make is assuming dependency where there is none. "I need to research before I analyze" is true when there's one data source. When there are five data sources, the five research tasks can run in parallel, and the analysis step waits for all of them.
Real-world fan-out implementations
Claude Code Review: 5 parallel Sonnet agents. When reviewing a pull request, Claude's code review system fans out to 5 parallel Sonnet agents, each with a specialized lens — CLAUDE.md compliance, shallow bugs, git history context, previous PR patterns, and code comment compliance. Each agent applies its analysis independently. The results are collected, deduplicated, and scored. This parallel design means the review is as fast as the slowest single-lens pass.
The system uses confidence scoring (0-100) with a threshold of 80 or above to surface comments. This filtering at the reducer stage is critical — without it, five agents produce five times the noise. The result: a false positive rate under 1%.
Ellipsis: dozens of specialized agents in parallel. Ellipsis takes a more granular approach, dispatching dozens of specialized agents — one for documentation checks, one for test coverage analysis, one for dependency auditing, and so on. Each agent uses the model best suited to its task (mixing GPT-4o and Claude Sonnet), and results are merged in a central reducer. The parallel execution means adding a new specialized agent doesn't increase total latency.
Codex: FuturesOrdered parallel tool execution. OpenAI's Codex uses a FuturesOrdered pattern — tool calls that don't depend on each other are dispatched simultaneously, but results are returned in submission order. This preserves deterministic output while capturing the latency benefits of parallelism.
The reducer: where fan-out gets hard
Dispatching tasks is easy. Merging results is where the engineering lives.
When five agents analyze the same code, they'll find overlapping issues. Agent A and Agent C might both flag the same SQL injection risk. Agent B might find a performance issue that contradicts Agent D's suggestion. The reducer has to handle all of this.
Deduplication. Identify semantically identical findings, even when phrased differently. "This SQL query is vulnerable to injection" and "User input is concatenated into the query without sanitization" are the same finding. Deduplication can be a simple LLM call that compares pairs, or a more sophisticated embedding-based similarity check.
Ranking. Not all findings are equal. The reducer should assign priority based on severity, confidence, and agreement. A finding flagged by 4 out of 5 agents is more likely real than one flagged by 1 out of 5.
Conflict resolution. When agents disagree, the reducer needs a strategy. Options: majority vote, defer to the highest-confidence agent, escalate to a stronger model, or flag the disagreement for human review.
interface FanOutResult<T> {
agentId: string;
result: T;
confidence: number;
duration: number;
error?: Error;
}
async function reduceResults<T>(
results: FanOutResult<T>[],
dedup: (a: T, b: T) => boolean,
rank: (item: T, votes: number) => number
): Promise<T[]> {
// Filter out errors
const successful = results.filter((r) => !r.error);
// Deduplicate
const unique: { item: T; votes: number; maxConfidence: number }[] = [];
for (const r of successful) {
const existing = unique.find((u) => dedup(u.item, r.result));
if (existing) {
existing.votes++;
existing.maxConfidence = Math.max(existing.maxConfidence, r.confidence);
} else {
unique.push({ item: r.result, votes: 1, maxConfidence: r.confidence });
}
}
// Rank and return
return unique
.sort((a, b) => rank(b.item, b.votes) - rank(a.item, a.votes))
.map((u) => u.item);
}Error handling: fail-fast vs best-effort
When one of five parallel agents fails, you have two choices:
Fail-fast (`Promise.all`). If any agent fails, the entire fan-out fails. Use this when every agent's result is required — you can't produce a complete code review without the security analysis.
Best-effort (`Promise.allSettled`). Collect whatever succeeds, ignore failures. Use this when partial results still have value — a report based on 4 out of 5 data sources is better than no report.
// Fail-fast: all or nothing
const allResults = await Promise.all(
agents.map((agent) => agent.analyze(chunk))
);
// Best-effort: take what you can get
const settled = await Promise.allSettled(
agents.map((agent) => agent.analyze(chunk))
);
const succeeded = settled
.filter((r): r is PromiseFulfilledResult<AnalysisResult> =>
r.status === "fulfilled"
)
.map((r) => r.value);
const failed = settled.filter((r) => r.status === "rejected");
if (failed.length > 0) {
console.warn(`${failed.length}/${agents.length} agents failed`);
}Most production systems use a hybrid: best-effort with a minimum threshold. If 4 out of 5 agents succeed, proceed. If 2 out of 5 succeed, fail. The threshold depends on your quality requirements.
Latency comparison
Here's the math for a code review pipeline analyzing 5 files, where each analysis takes 2-4 seconds:
| Approach | Total latency | Agent calls |
|----------|--------------|-------------|
| Sequential | 15s (sum of all) | 5 serial |
| Parallel fan-out | 4s (max of all) | 5 concurrent |
| Parallel + reduce | 5s (max + reduce) | 5 concurrent + 1 |
The reducer adds roughly 1 second. You're still 3x faster than sequential, and for larger fan-outs the ratio improves. At 20 parallel agents, sequential takes 60 seconds. Parallel takes 4.
Implementation with concurrency limits
Unbounded parallelism is dangerous. If you dispatch 50 agents simultaneously, you'll hit API rate limits, overwhelm your infrastructure, and produce worse results (models under load can return lower-quality outputs).
async function fanOutWithLimit<T, R>(
items: T[],
worker: (item: T) => Promise<R>,
concurrency: number
): Promise<FanOutResult<R>[]> {
const results: FanOutResult<R>[] = [];
const executing = new Set<Promise<void>>();
for (const [i, item] of items.entries()) {
const p = (async () => {
const start = Date.now();
try {
const result = await worker(item);
results.push({
agentId: `agent-${i}`,
result,
confidence: 1,
duration: Date.now() - start,
});
} catch (error) {
results.push({
agentId: `agent-${i}`,
result: undefined as any,
confidence: 0,
duration: Date.now() - start,
error: error as Error,
});
}
})();
executing.add(p);
p.finally(() => executing.delete(p));
if (executing.size >= concurrency) {
await Promise.race(executing);
}
}
await Promise.all(executing);
return results;
}Set concurrency to match your rate limits. For Claude API, 5-10 concurrent requests is a safe starting point for most tiers.
Combining with sequential chaining
Fan-out rarely exists alone. The most common production pattern is a sequential pipeline with fan-out stages:
1. Parse (sequential) — break the input into chunks
2. Analyze (fan-out) — analyze each chunk in parallel
3. Reduce (sequential) — merge and deduplicate results
4. Format (sequential) — produce the final output
This hybrid gives you the debuggability of [sequential chaining](/guides/sequential-chaining-ai-agents) where you need it and the speed of parallelism where the tasks allow it.
For production systems that need coordination across many parallel agents, consider the [supervisor-worker pattern](/guides/supervisor-worker-ai-orchestration) — it adds a planning layer that decides what to parallelize and how to merge results.