guide

Supervisor-Worker: The Orchestration Pattern That Scales AI Agent Teams

One agent plans and delegates. Many agents execute. Here's how to build it without losing control.

Why single agents stop scaling

A single agent handling a complex task hits three walls simultaneously: the context window fills up, the instructions become contradictory, and debugging becomes impossible. When one agent is responsible for research, analysis, code generation, testing, and documentation, it does all of them poorly.

The supervisor-worker pattern solves this by splitting responsibilities. One agent — the supervisor — handles planning, delegation, and synthesis. Multiple agents — the workers — each handle one specific task. The supervisor never writes code. The workers never plan. This separation is what makes the pattern scale.

Anthropic's research confirms the approach: multi-agent architectures outperform single-agent systems by 90.2%. The gain comes from specialization. A worker agent with a focused system prompt, limited tools, and a narrow objective consistently outperforms a general-purpose agent given the same subtask.

Anatomy of the pattern

The supervisor

The supervisor's job is three things:

Task decomposition. Given a high-level objective ("Review this pull request for security, performance, and style issues"), the supervisor breaks it into discrete tasks: "Check for SQL injection vulnerabilities," "Identify N+1 query patterns," "Verify naming conventions match the style guide." Each task is a self-contained work unit.

Worker selection. Not every task goes to every worker. The supervisor matches tasks to workers based on their specialization. A security worker gets security tasks. A style worker gets style tasks. This matching can be static (predefined routing rules) or dynamic (the supervisor reads worker descriptions and decides at runtime).

Result synthesis. When workers return their results, the supervisor merges them into a coherent output. This means deduplicating, resolving conflicts, prioritizing findings, and formatting the final deliverable. Synthesis is where the supervisor's intelligence matters most — it's the hardest step to get right.

The workers

Workers are deliberately limited. Each worker has:

Bounded context. A worker sees only the system prompt for its specialization and the specific task assigned to it. It doesn't see other workers' tasks or results. This constraint prevents context pollution and keeps each worker focused.

Defined contracts. The supervisor sends a structured task object. The worker returns a structured result object. The format is fixed. This means you can swap workers, add new workers, or upgrade a worker's model without changing the supervisor or other workers.

Statelessness. Workers don't maintain state between invocations. Each task is a fresh call. This makes workers easy to parallelize, retry, and scale. If a worker fails, you restart it — you don't need to reconstruct its state.

Production example: Claude Code Review

Claude's code review system is a textbook supervisor-worker implementation. The orchestrator (supervisor) receives a pull request. It parses the diff, identifies which files changed, and decomposes the review into tasks. It then dispatches these to 5 parallel Sonnet agents (workers), each with a specialized lens — CLAUDE.md compliance, shallow bugs, git history context, previous PR patterns, and code comment compliance.

Each worker reviews its assigned scope and returns findings with confidence scores (0-100). The orchestrator collects all findings, deduplicates across workers, filters by confidence threshold (80 or above), and assembles the final review.

The system uses a 3-model tier strategy:

Haiku for classification and routing (cheap, fast)
Sonnet for the actual review work (balanced cost/quality)
Opus for complex reasoning when a finding needs deeper analysis

This model mixing is a direct benefit of the supervisor-worker pattern. A single-agent system uses one model for everything. A supervisor-worker system uses the right model for each task.

Communication protocols

The contract between supervisor and worker is the most important design decision. Get this wrong and the system falls apart.

Task schema

interface WorkerTask {
  taskId: string;
  type: "security" | "performance" | "style" | "logic";
  payload: {
    code: string;
    filePath: string;
    language: string;
    context?: string;
  };
  constraints: {
    maxTokens: number;
    timeoutMs: number;
    model: "haiku" | "sonnet" | "opus";
  };
}

Result schema

interface WorkerResult {
  taskId: string;
  findings: {
    severity: "critical" | "warning" | "info";
    line: number;
    message: string;
    suggestion?: string;
    confidence: number;  // 0-100
  }[];
  metadata: {
    tokensUsed: number;
    durationMs: number;
    modelUsed: string;
  };
}

Error escalation

Workers should never silently fail. When a worker encounters something it can't handle, it returns an explicit escalation:

interface WorkerEscalation {
  taskId: string;
  reason: "out_of_scope" | "ambiguous_input" | "confidence_too_low" | "timeout";
  partialResult?: WorkerResult;
  message: string;
}

The supervisor then decides: retry with a better model, reassign to a different worker, break the task into smaller pieces, or escalate to a human.

Building the supervisor

The supervisor itself is an LLM-powered agent. Its system prompt should be explicit about its role:

const supervisorPrompt = `You are a code review supervisor. Your job is to:

1. Decompose the pull request into reviewable chunks
2. Assign each chunk to the appropriate worker
3. Synthesize worker results into a final review

You do NOT review code yourself. You delegate to specialized workers.

Available workers:
- security_worker: SQL injection, XSS, auth bypass, secrets detection
- performance_worker: N+1 queries, memory leaks, algorithmic complexity
- style_worker: naming conventions, code organization, documentation

For each chunk, select the most relevant worker(s).
Return your plan as structured JSON.`;

The supervisor calls an LLM to produce the task decomposition, then dispatches tasks to workers programmatically. This is important — the supervisor uses LLM intelligence for planning but deterministic code for execution.

async function supervisorRun(pr: PullRequest): Promise<ReviewResult> {
  // Step 1: Plan (LLM)
  const plan = await llm.generate({
    model: "claude-sonnet-4-20250514",
    system: supervisorPrompt,
    prompt: `Decompose this PR into review tasks:\n${pr.diff}`,
    responseFormat: { type: "json_schema", schema: ReviewPlanSchema },
  });

  // Step 2: Dispatch (deterministic)
  const tasks = plan.tasks.map((t) => dispatchToWorker(t));
  const results = await Promise.allSettled(tasks);

  // Step 3: Handle escalations
  const escalations = results
    .filter((r) => r.status === "fulfilled" && r.value.type === "escalation");
  for (const esc of escalations) {
    await handleEscalation(esc.value);
  }

  // Step 4: Synthesize (LLM)
  const findings = results
    .filter((r) => r.status === "fulfilled" && r.value.type === "result")
    .flatMap((r) => r.value.findings);

  const synthesis = await llm.generate({
    model: "claude-sonnet-4-20250514",
    system: "Deduplicate and rank these code review findings.",
    prompt: JSON.stringify(findings),
  });

  return synthesis;
}

CrewAI's role-based approach

CrewAI implements the supervisor-worker pattern through its "crew" abstraction. You define agents with roles (Researcher, Writer, Editor), assign them tools and backstories, and compose them into a crew with a defined process (sequential or hierarchical).

In hierarchical mode, a "manager" agent acts as the supervisor — decomposing the crew's overall task and delegating to role-specific agents. CrewAI's 12 million daily agent executions prove the pattern works at scale, though the role-playing prompt pattern adds token overhead compared to a minimal supervisor implementation.

The tradeoff: CrewAI gets you from zero to working supervisor-worker in an afternoon. A custom implementation takes longer but gives you full control over task schemas, model routing, and error handling.

Common failure modes

Supervisor doing worker work. The most common mistake. If the supervisor starts "helping" by doing analysis alongside delegation, it becomes a bottleneck and its context window fills with details that belong in workers. Enforce the boundary: the supervisor plans and synthesizes. It never executes.

Overly broad workers. A "general code review" worker is just a single agent with a different name. Workers should be narrow. A security worker that also checks style is doing two jobs. Split it.

Missing escalation paths. When a worker doesn't know what to do, it should say so. Without escalation, workers produce low-confidence results that the supervisor treats as authoritative. Always give workers an "I don't know" option.

Synchronous bottleneck. If the supervisor dispatches workers one at a time and waits for each, you've built an expensive sequential chain. Dispatch all independent workers in parallel. The supervisor should be waiting for the slowest worker, not the sum of all workers.

Scaling up

Once you have a working supervisor-worker system, the natural extension is [hierarchical multi-agent systems](/guides/hierarchical-multi-agent-systems) — supervisors that delegate to other supervisors. Codex uses this pattern with its recursive sub-agents (Default, Explorer, Worker), where each sub-agent can itself spawn further sub-agents with depth limits.

For quality assurance, add [reflection and critique loops](/guides/reflection-critique-loops-ai-agents) — a critic worker that reviews the supervisor's synthesis before it's delivered. This catches errors that individual workers miss because they only see their slice.

The supervisor-worker pattern is the workhorse of production multi-agent systems. Get the contracts right, keep workers narrow, and let the supervisor focus on planning. Everything else follows.

Sources