guide

Hierarchical Multi-Agent: How to Build AI Organizations That Self-Coordinate

Flat agent graphs collapse under complexity. Hierarchies scale the same way organizations do — delegation down, results up.

The Flat Graph Ceiling

A single supervisor managing five workers is clean. A single supervisor managing twenty workers is chaos. The supervisor's context window fills with status updates. Its routing decisions degrade. It can't tell the difference between a worker that's stuck and a worker that's doing something complex.

This is the same scaling problem human organizations solved centuries ago: you add management layers. A CEO doesn't assign tasks to individual engineers. They delegate to VPs, who delegate to directors, who delegate to leads, who assign the actual work.

Hierarchical multi-agent systems apply this organizational pattern to AI. And the results are significant — Anthropic's research found that multi-agent architectures outperform single-agent approaches by 90.2% on complex tasks. The key insight is that the performance gain comes not from adding more agents, but from structuring how they coordinate.

The flat graph fails for three specific reasons. First, the supervisor becomes a bottleneck — every status update, every failure report, every result passes through a single node whose context window has finite capacity. Second, routing quality degrades with fan-out — a supervisor choosing between 3 workers makes good decisions, but choosing between 15 makes mediocre ones because the differences between workers blur in a crowded context. Third, failure isolation is impossible — when everything reports to one node, a noisy failure in worker 7 contaminates the supervisor's reasoning about workers 8 through 15.

Tree vs DAG: Choosing Your Topology

The simplest hierarchy is a tree. Each agent has exactly one parent and zero or more children. Tasks flow down, results flow up. No agent reports to two managers.

         Orchestrator
        /      |      \
   Research  Analysis  Writing
   /    \       |
 Web   DB    Stats

Trees are easy to reason about. When something fails, you know exactly which branch it's in. The downside: no sharing. If both the Research and Analysis branches need the same data, they each fetch it independently.

A DAG (directed acyclic graph) allows shared dependencies. The Stats agent can feed results to both Analysis and Writing. This is more efficient but harder to manage — you need dependency resolution, and a failure in a shared node can cascade to multiple branches.

When to use a tree: Your subtasks are independent. A report with separate chapters, a codebase with isolated modules, an investigation with distinct evidence streams.

When to use a DAG: Your subtasks share intermediate results. A research pipeline where data collection feeds both analysis and visualization. A code review where the same AST parse serves both style checking and security scanning.

For most production systems, start with a tree. Refactor to a DAG only when you have measurable duplication.

Delegation Protocols

Delegation in a hierarchy has three phases: goal passing, context compression, and result roll-up. Get any of these wrong and the hierarchy collapses into noise.

Goal Passing

The parent agent decomposes its goal into sub-goals and assigns them to children. The critical design decision: how much autonomy does the child get?

Tight delegation specifies exact actions: "Fetch the Q3 revenue data from the finance API, filter to North America, return as a JSON array." The child is a specialized executor.

Loose delegation specifies outcomes: "Get the data we need to assess North American Q3 performance." The child decides how to accomplish it — which APIs to call, what filters to apply, whether to cross-reference sources.

Tight delegation gives you predictability but limits the child's ability to handle edge cases. Loose delegation gives you flexibility but requires smarter (more expensive) child agents. MetaGPT's hierarchy illustrates this well — their PM agent provides loose delegation to the Architect agent ("design a system for X"), but the Architect provides tight delegation to the Engineer agent ("implement this interface with these method signatures").

Context Compression

A parent has context that's irrelevant to a child. Sending the full context wastes tokens and confuses the child with irrelevant information. Token usage explains roughly 80% of performance variance in multi-agent systems, so efficient context management is not optimization — it's architecture.

interface DelegatedTask {
  goal: string;
  relevantContext: string;   // compressed, not the full parent context
  constraints: string[];
  outputSchema: z.ZodType;   // what the parent expects back
  budget: {
    maxTokens: number;
    maxToolCalls: number;
    timeoutMs: number;
  };
}

The parent should strip its context to only what the child needs. If the parent is orchestrating a market report and delegates "analyze competitor pricing" to a child, the child doesn't need the parent's full conversation history — it needs the list of competitors and a pricing data schema.

Result Roll-Up

Children return results to parents. The parent aggregates, synthesizes, and either acts on the combined result or passes a summary to its own parent.

Roll-up rules:

Children return structured data, not prose. The parent synthesizes.
If a child returns more data than the parent can consume, the child includes a summary.
Status updates flow up asynchronously. Final results flow up synchronously.

Failure Propagation

Hierarchies need clear failure rules. When a leaf agent fails, what happens? Three strategies, in order of sophistication.

Strategy 1: Fail Up

The child reports failure to its parent with an error description. The parent decides what to do — retry the child, reassign to a different child, degrade gracefully, or escalate to its own parent.

This is the simplest pattern. The downside: cascading delays. If a leaf fails, the escalation has to travel up the tree, and the recovery decision has to travel back down.

Strategy 2: Sibling Substitution

The parent keeps a pool of children with overlapping capabilities. If child A fails, the parent routes to child B. This requires children that can pick up mid-task — they need access to whatever state child A accumulated before failing.

OpenHands uses this pattern in its hierarchical delegation primitives. When a specialized worker agent fails at a subtask, the orchestrator can reassign to a different worker with appropriate context.

Strategy 3: Circuit Breakers with Escalation

Track failure rates per child. If a child fails more than N times in a window, stop sending it work and escalate to the parent's parent. This prevents a broken branch from consuming the entire system's budget while it fails repeatedly.

interface CircuitBreaker {
  agentId: string;
  failureCount: number;
  windowStart: number;
  threshold: number;     // e.g. 3 failures
  windowMs: number;      // e.g. 60_000
  state: 'closed' | 'open' | 'half-open';
}

function shouldDelegate(breaker: CircuitBreaker): boolean {
  if (breaker.state === 'open') {
    // Check if cooldown elapsed
    if (Date.now() - breaker.windowStart > breaker.windowMs) {
      breaker.state = 'half-open';
      return true; // Allow one test request
    }
    return false;
  }
  return true;
}

Depth Limits: The Spawn Problem

Recursive hierarchies can spawn infinitely. An orchestrator creates sub-orchestrators, which create their own sub-orchestrators, until you've burned through your token budget on management overhead alone.

Codex solves this with explicit depth limits and resource reservation. Their recursive sub-agent system uses three roles — Default, Explorer, and Worker — with a SpawnReservation mechanism that implements RAII-style resource management. Before a parent agent can spawn a child, it must acquire a reservation from a finite pool. When the child completes (or fails), the reservation is released. If the pool is exhausted, no more children can be spawned, and the parent must handle the work itself or fail.

class SpawnReservation {
  private maxDepth: number;
  private currentDepth: number;
  private maxConcurrent: number;
  private activeChildren: number;

  canSpawn(): boolean {
    return (
      this.currentDepth < this.maxDepth &&
      this.activeChildren < this.maxConcurrent
    );
  }

  acquire(): SpawnToken | null {
    if (!this.canSpawn()) return null;
    this.activeChildren++;
    return {
      depth: this.currentDepth + 1,
      release: () => { this.activeChildren--; }
    };
  }
}

This is non-negotiable for production systems. Without depth limits, a single confusing task can cascade into hundreds of agent spawns. Set `maxDepth = 3` and `maxConcurrent = 5` as starting defaults. Adjust based on your actual workload and budget.

Case Study: Enterprise Report Generation

A real-world example: generating a quarterly business report with data from five departments.

Depth 0 — Report Orchestrator: Receives the goal "Generate Q3 business report." Decomposes into five department analyses plus a synthesis task.

Depth 1 — Department Agents (5x): Each receives "Analyze Q3 performance for [department]." Each has access to department-specific tools (APIs, databases, document stores). Each returns a structured analysis.

Depth 2 — Data Workers (variable): Department agents spawn workers for specific data tasks. The Finance department agent might spawn workers for revenue analysis, cost analysis, and margin calculation. These are the leaves — they call tools, process data, and return results.

const reportOrchestrator = {
  role: 'orchestrator',
  maxDepth: 3,
  children: ['finance', 'engineering', 'marketing', 'sales', 'ops'],
  async execute(goal: string) {
    const spawnPool = new SpawnReservation({ maxDepth: 3, maxConcurrent: 5 });
    const departmentResults = await Promise.all(
      this.children.map(async (dept) => {
        const token = spawnPool.acquire();
        if (!token) return { dept, status: 'skipped', reason: 'no capacity' };
        try {
          const agent = createDepartmentAgent(dept, token.depth);
          return await agent.analyze(`Q3 performance for ${dept}`);
        } finally {
          token.release();
        }
      })
    );
    return synthesize(departmentResults);
  }
};

The total depth is 3. The total agent count ranges from 6 to 20 depending on how many workers each department agent spawns. Each agent has bounded context — a department agent doesn't see other departments' data. Failures in one branch don't block others.

Why this beats a flat approach: A single supervisor managing 15 data workers would need to understand finance APIs, engineering metrics, marketing analytics, sales CRMs, and operations dashboards simultaneously. The hierarchical version lets each department agent specialize — the finance agent knows how to interpret GAAP metrics, the engineering agent understands velocity and incident rates. Specialization at each level means better routing, better context, and better results.

The orchestrator's synthesis step is also cleaner. Instead of aggregating 15 raw data outputs, it receives 5 structured department summaries. The cognitive load on the top-level agent drops from "make sense of raw data from five domains" to "synthesize five pre-analyzed summaries into a coherent narrative."

When Not to Use Hierarchy

Hierarchy adds overhead. Every layer adds latency (the parent has to process, delegate, and aggregate) and cost (the parent's LLM calls are pure coordination overhead). If your task naturally decomposes into 3-5 independent subtasks, a flat [supervisor-worker pattern](/guides/supervisor-worker-ai-orchestration) is simpler and cheaper.

Use hierarchy when:

Subtasks themselves are complex enough to require further decomposition
You need more than ~8 workers (a single supervisor can't route effectively beyond this)
Different branches require different tools, permissions, or model tiers
Failures in one branch should be isolated from other branches

Pair hierarchies with [reflection and critique loops](/guides/reflection-critique-loops-ai-agents) at each level — the parent agent reviews its children's outputs before rolling them up. This catches errors before they propagate upward and compound.

Sources