guide

Human-in-the-Loop: When and How to Insert Human Judgment into AI Agent Pipelines

Not every agent decision should be autonomous. The best systems know exactly where to pause and ask.

The Autonomy Trap

The METR study found that experienced developers were 19% slower when using AI tools, spent 91% more time on review, and introduced 9% more bugs. The finding isn't that AI tools are bad — it's that unchecked automation creates hidden costs. Agents that run without human oversight are faster in the best case and catastrophic in the worst case.

Human-in-the-loop isn't a concession to the limitations of AI. It's an engineering pattern that routes decisions to the right processor — human brains for judgment under uncertainty, LLMs for speed on well-defined tasks. The best production agents aren't fully autonomous. They're selectively autonomous.

When to Insert a Human

Four categories of decisions that need human judgment.

High Stakes

Any action that's expensive or irreversible. Sending money. Deleting data. Publishing content to production. Modifying infrastructure. If the cost of a wrong decision exceeds the cost of a 5-minute delay, insert a human checkpoint.

Novel Situations

The agent encounters something outside its training distribution. A customer request it's never seen. A data pattern that doesn't match any template. An error it wasn't designed to handle. These are exactly the situations where LLMs hallucinate most confidently.

Regulatory Requirements

Healthcare, finance, legal — industries where regulations mandate human oversight. An AI agent can draft a medical summary but a clinician must approve it. An agent can flag suspicious transactions but a compliance officer must decide on escalation.

Trust Building

When you're deploying a new agent, run it with human approval on 100% of actions. As confidence builds, loosen to 50%, then 10%, then only edge cases. This graduated trust model is how you get organizational buy-in without asking stakeholders to trust an unproven system.

The business case for HITL is concrete. An agent that runs autonomously and makes a costly mistake erodes trust in the entire AI initiative — not just the agent. An agent that pauses at the right moments and presents clear reasoning builds trust incrementally. Organizations that skip the trust-building phase typically discover, painfully, that the political cost of one high-profile failure exceeds the efficiency gains of months of successful automation.

Trigger Design

The hard part isn't building the pause mechanism. It's deciding when to trigger it. Three trigger patterns, in order of sophistication.

Confidence Thresholds

The agent scores its own confidence on each action. Below the threshold, it pauses for human review. Claude Code's review system uses a 0-100 confidence scale: 0 is "false positive, ignore this," 25 is "unverified claim," 50 is "minor issue," 75 is "likely real," and 100 is "definitely real." Actions below threshold go to a human; actions above proceed automatically.

interface AgentAction {
  type: string;
  confidence: number;
  reasoning: string;
  payload: unknown;
}

function requiresApproval(action: AgentAction, config: HITLConfig): boolean {
  // Always require approval for blocklisted actions
  if (config.blocklist.includes(action.type)) return true;

  // Never require approval for allowlisted actions
  if (config.allowlist.includes(action.type)) return false;

  // Confidence threshold for everything else
  return action.confidence < config.confidenceThreshold;
}

Action Allow/Block Lists

Some actions are always safe (read-only queries, status checks). Some are never safe without approval (transfers over $1,000, production deployments). The list-based approach is deterministic — no confidence estimation needed.

Codex's Guardian system implements this with granular approval strategies: Never (fully autonomous), OnFailure (human reviews only when something breaks), OnRequest (agent can request human input), and fine-grained per-action policies. This is the most mature HITL trigger design in production today.

Cost Guardrails

If a single agent run has spent more than X dollars, pause and ask a human whether to continue. This catches runaway loops, unexpected API costs, and tasks that are more complex than the agent estimated.

function checkCostGuardrail(state: AgentState): boolean {
  const costUsd = estimateCost(state.tokensUsed, state.model);
  return costUsd > state.costLimit;
}

Handoff UX: Pausing Cleanly

When the agent pauses for human review, it needs to present three things:

1. What It Wants to Do

The proposed action, in plain language. Not "executing tool `transfer_funds` with args `{amount: 5000, to: 'acct_123'}`" but "Transfer $5,000 to vendor account ending in 4567."

2. Why It Wants to Do It

The reasoning chain that led to this action. "The invoice from Acme Corp is due today. The amount matches the PO. Previous invoices from this vendor were approved."

3. What Happens If the Human Says No

The fallback. "If rejected, I'll flag this invoice for manual review and move to the next item in the queue."

Devin's interactive planning system does this well — it presents its intended plan to the user, explains its reasoning, and waits for approval or modification before executing. The user can adjust the plan, not just approve or reject it.

Timeout Handling

Humans don't always respond. Design for it.

interface HITLRequest {
  id: string;
  action: AgentAction;
  context: string;
  requestedAt: Date;
  timeoutMs: number;
  onTimeout: 'deny' | 'approve' | 'escalate';
}

Deny on timeout is the safe default — if nobody approves, nothing happens. Approve on timeout is for low-risk actions where the delay cost exceeds the risk. Escalate on timeout sends to a different reviewer or a backup channel. Codex's Guardian uses a fail-closed approach: timeouts are treated as denials. This is the right default for any system handling sensitive operations.

Resume Patterns

The human has decided. Now the agent needs to continue from where it paused. This is surprisingly tricky.

Pattern 1: Inject and Continue

The human's decision is appended to the agent's context as a new message. The agent reads the decision and continues its loop.

async function resumeAfterHITL(
  state: AgentState,
  decision: HITLDecision
): Promise<void> {
  const resumeMessage = decision.approved
    ? `Human approved: "${decision.action.type}" — proceed.`
    : `Human rejected: "${decision.action.type}" — reason: "${decision.reason}". Find an alternative approach.`;

  state.turns.push({ role: 'system', content: resumeMessage });

  // Continue the agent loop
  await agentLoop(state);
}

Pattern 2: Checkpoint and Restore

For long-running agents, serialize the full state before the HITL pause. On resume, deserialize and continue. This handles cases where the agent process might be shut down during the wait. LangGraph's checkpointing makes this native — the graph state is persisted at every node, and a human approval node simply blocks until input arrives.

Pattern 3: Event-Driven Resume

The HITL pause emits an event. The human response emits another event. The agent subscribes to the response event and resumes when it arrives. This decouples the agent from the reviewer entirely — different processes, potentially different machines.

AutoGen's human-in-the-loop tutorial uses this pattern. The agent conversation pauses, the human is prompted in their terminal or UI, and the response flows back into the conversation as a new message.

Example: Financial Transaction Agent

A concrete implementation tying everything together.

const hitlConfig: HITLConfig = {
  confidenceThreshold: 85,
  allowlist: ['check_balance', 'list_transactions', 'get_exchange_rate'],
  blocklist: [],
  costLimit: 0.50,
  rules: [
    { condition: (a) => a.type === 'transfer' && a.payload.amount > 1000,
      action: 'require_approval' },
    { condition: (a) => a.type === 'transfer' && a.payload.amount > 10000,
      action: 'require_approval', escalateTo: 'manager' },
    { condition: (a) => a.type === 'close_account',
      action: 'require_approval', escalateTo: 'compliance' },
  ],
  timeoutMs: 300_000, // 5 minutes
  onTimeout: 'deny',
};

async function processTransaction(action: AgentAction): Promise<TransactionResult> {
  if (requiresApproval(action, hitlConfig)) {
    const request = createHITLRequest(action, hitlConfig);

    // Notify reviewer
    await notifyReviewer(request);

    // Wait for decision (with timeout)
    const decision = await waitForDecision(request);

    if (!decision.approved) {
      return { status: 'rejected', reason: decision.reason };
    }

    // Log the approval for audit
    await auditLog.record({
      action: action.type,
      approvedBy: decision.reviewer,
      approvedAt: new Date(),
      amount: action.payload.amount,
    });
  }

  return executeTransaction(action);
}

The system applies three layers of HITL triggers: confidence threshold (agent-estimated), rule-based (amount thresholds with role-based escalation), and cost guardrail. Transfers over $1,000 always require approval. Transfers over $10,000 escalate to a manager. Account closures go to compliance. Everything else flows through the confidence threshold.

Measuring HITL Effectiveness

Track four metrics to know whether your HITL system is working.

Approval rate by action type. If an action type is approved 99% of the time, the human checkpoint is adding delay without adding value. Move it to the allowlist. If an action type is rejected 40% of the time, the agent needs better reasoning for that task, not more human review.

Reviewer response time. If the average review takes 8 minutes and your SLA is 2 minutes, the bottleneck is human capacity, not agent quality. Consider adding more reviewers, improving the context presentation, or relaxing the trigger threshold.

Override quality. When humans override the agent, are the overrides correct? Sample and audit. If humans override correctly 95% of the time, the HITL system is working. If humans rubber-stamp 80% of reviews without reading context, you have reviewer fatigue — reduce review volume or improve the review UX.

Post-approval failure rate. Of actions that humans approved, how many resulted in bad outcomes? This is the ground truth metric. If approved actions fail at a lower rate than autonomous actions, the HITL system is adding real value.

The Graduation Path

HITL is not the end state. It's the path to earned autonomy.

Track approval rates over time. If 98% of a particular action type is approved, consider moving it to the allowlist. If a certain class of rejections keeps recurring, improve the agent's handling of those cases.

The goal is progressive trust: start with human approval on everything, identify which actions the agent handles reliably, and gradually loosen the reins. This is how you move from [event-driven reactive agents](/guides/event-driven-agent-orchestration) to truly autonomous systems — not by trusting the agent blindly, but by proving trust with data.

Pair HITL triggers with [reflection and critique loops](/guides/reflection-critique-loops-ai-agents) as a pre-filter. If the agent's internal critic catches a problem before the action reaches the human, you've saved the reviewer's time and built a faster feedback loop.

Sources