guide

Reflection and Critique: How AI Agents Can Review and Improve Their Own Output

A second-pass critic catches what the first pass misses. Here's how to design critique loops that converge instead of looping forever.

The Single-Pass Problem

An LLM generates its response in one forward pass. It doesn't re-read its own output, check for contradictions, or verify facts against its sources. What comes out is a first draft — and first drafts have errors.

Claude Code Review found that moving from single-pass to multi-pass review with confidence scoring increased the rate of substantive comments from 16% to 54%. That's not a marginal improvement. That's the difference between a tool that mostly produces noise and a tool that mostly produces signal.

Reflection is the pattern that closes this gap. A second agent (or the same agent with a different prompt) reviews the first agent's output, scores it, and either approves it or sends it back for revision.

What Reflection Actually Is

Reflection is not "ask the LLM if its answer is correct." That's self-evaluation, and it's unreliable — the same biases that produced the error will evaluate it as correct.

Real reflection uses structural separation:

Different context. The critic doesn't see the original prompt or reasoning. It sees only the output and judges it on its merits.
Different framing. The critic is prompted as an adversary, not a helper. Its job is to find problems, not validate the output.
Structured output. The critic doesn't produce prose opinions. It produces a structured assessment with specific dimensions, scores, and actionable feedback.

Codex's Guardian system is the most rigorous production implementation of this pattern. The Guardian runs in a separate read-only session, treats all inputs — including the primary agent's output — as untrusted evidence, and produces a structured risk_score from 0 to 100. It's fail-closed: if the Guardian can't reach a verdict (timeout, error), the output is rejected. This is adversarial verification, not polite self-review.

Critique Dimensions

A generic "is this good?" prompt produces generic assessments. Effective critique requires specific dimensions that map to your quality requirements.

Accuracy

Does the output contain factual errors? Are cited sources real? Do numbers add up? This is the hardest dimension to evaluate automatically — it often requires tool access (web search, database queries) to verify claims.

Completeness

Does the output address the full scope of the request? Are there gaps? Did the agent skip a required section or ignore part of the input? Completeness checks are straightforward: compare the output's coverage against the input's requirements.

Tone and Style

Is the output appropriate for its audience? Does it match the expected format? Is it too casual, too formal, too verbose? Style checks work well with rubric-based prompts: "Rate formality on a 1-5 scale where 1 is casual conversation and 5 is legal document."

Format Compliance

Does the output match the required schema? Are all required fields present? Is the JSON valid? Format checks are best handled by schema validation, not LLM critique — use Zod, JSON Schema, or Guardrails AI for deterministic validation.

Policy Compliance

Does the output violate any rules? Content policies, brand guidelines, regulatory requirements, internal constraints. Policy checks require the critic to have access to the policy document and compare the output against each rule.

interface CritiqueResult {
  dimensions: {
    accuracy:    { score: number; issues: string[] };
    completeness: { score: number; issues: string[] };
    tone:        { score: number; issues: string[] };
    format:      { score: number; issues: string[] };
    policy:      { score: number; issues: string[] };
  };
  overallScore: number;
  verdict: 'approve' | 'revise' | 'reject';
  revisionInstructions?: string;
}

The Critique-Revision Cycle

The critic finds problems. The producer revises. The critic reviews again. How many times?

Convergence Detection

Track the overall score across iterations. If the score improves from iteration 1 to 2, another round is worthwhile. If the score plateaus or decreases, stop — further iteration is wasting tokens without improving quality.

function shouldContinue(history: CritiqueResult[]): boolean {
  if (history.length >= MAX_PASSES) return false;
  if (history.length < 2) return true;

  const current = history[history.length - 1].overallScore;
  const previous = history[history.length - 2].overallScore;

  // Stop if score plateaued or regressed
  if (current <= previous) return false;

  // Stop if already above quality threshold
  if (current >= QUALITY_THRESHOLD) return false;

  return true;
}

Avoiding Infinite Loops

Without a hard cap, a perfectionist critic and an eager producer will loop forever — the critic always finds something to improve, the producer always tries to fix it, and the "improvements" start degrading other dimensions.

Hard cap: 3 passes maximum. Research shows diminishing returns after 2-3 iterations. Set `MAX_PASSES = 3` and treat it as inviolable.

Quality floor: If the score is below 30 after the first pass, reject outright. The output is too far gone for iterative refinement — regenerate from scratch or escalate.

Dimension lock: Once a dimension scores above threshold, don't let subsequent revisions degrade it. If accuracy is at 90 after pass 1, flag any pass 2 revision that drops it below 85.

Critic Agent Design

The critic is a separate agent with its own prompt, its own context, and its own constraints. Design decisions that matter:

Separate Context

The critic should not see the producer's reasoning chain or the original user prompt (unless evaluating relevance). Seeing the reasoning introduces confirmation bias — "the agent thought about this carefully, so it must be right." The critic evaluates the output as a standalone artifact.

Claude Code Review's 8-step pipeline enforces this: the review system runs 5 parallel Sonnet agents, each with its own context, and uses a confidence scoring system where 0 means false positive and 100 means definite real issue. The system achieves a false positive rate below 1% — because each agent's assessment is independent.

Adversarial Framing

Prompt the critic as a skeptic, not a reviewer. "Your job is to find every flaw in this output. Assume the author made mistakes. Look for factual errors, logical gaps, missing context, and policy violations. If you find nothing wrong, say so — but look hard first."

Codex's Guardian takes this further: it treats all inputs as untrusted evidence and approaches every review adversarially. The risk_score it produces isn't "how good is this" — it's "how dangerous is it to let this through."

Structured Schema

Force the critic to fill out a schema, not write free-form feedback. Schemas prevent the critic from being vague ("this could be better") and force specific, actionable assessments.

Qodo's multi-agent system uses a Judge agent that evaluates outputs from multiple other agents, resolves conflicts between them, and produces a consolidated verdict. Their approach achieves an F1 score of 60.1% — meaningful signal that improves on any single agent's judgment.

Example: Content Pipeline with Critic

A production content pipeline that generates, critiques, and revises blog posts.

async function contentPipeline(brief: ContentBrief): Promise<ContentResult> {
  // Phase 1: Generate draft
  let draft = await producer.generate(brief);
  const critiqueHistory: CritiqueResult[] = [];

  // Phase 2: Critique-revision loop
  for (let pass = 0; pass < MAX_PASSES; pass++) {
    const critique = await critic.evaluate(draft, {
      dimensions: ['accuracy', 'completeness', 'tone', 'format', 'policy'],
      policyDoc: brief.styleguide,
      audienceLevel: brief.audience,
    });

    critiqueHistory.push(critique);
    logger.info(`Pass ${pass + 1}: score=${critique.overallScore}, verdict=${critique.verdict}`);

    if (critique.verdict === 'approve') break;
    if (critique.verdict === 'reject') {
      draft = await producer.generate(brief); // Full regeneration
      continue;
    }

    // Revise based on critique
    draft = await producer.revise(draft, critique.revisionInstructions, {
      // Lock dimensions that already pass
      preserveDimensions: Object.entries(critique.dimensions)
        .filter(([_, v]) => v.score >= QUALITY_THRESHOLD)
        .map(([k]) => k),
    });

    if (!shouldContinue(critiqueHistory)) break;
  }

  // Phase 3: Final validation (deterministic, not LLM)
  const formatValid = await validateSchema(draft, brief.outputSchema);
  const policyClean = await checkPolicies(draft, brief.policies);

  if (!formatValid || !policyClean) {
    return { status: 'failed', draft, issues: { formatValid, policyClean } };
  }

  return { status: 'approved', content: draft, passes: critiqueHistory.length };
}

The pipeline separates concerns: the producer generates, the critic evaluates, and deterministic validators handle format and policy checks. The LLM handles judgment (accuracy, completeness, tone). Code handles rules (schema, policy).

Reflection vs Human Review

Reflection doesn't replace human review. It filters out the obvious problems so humans review better content. The 16% to 54% substantive comment rate improvement in Claude Code Review means the review system handles the noise, and human reviewers focus on genuine issues.

The cost math makes this compelling. If a human reviewer spends 10 minutes per item reviewing raw LLM output, and 40% of those items have obvious problems the critic could have caught, you're wasting 4 minutes per item on problems a $0.02 LLM call could have flagged. At 100 items per day, that's nearly 7 hours of reviewer time saved — not by removing the human, but by giving the human cleaner input.

When Reflection Fails

Reflection is not a universal solution. Three cases where it breaks down.

Shared blind spots. If the producer and critic use the same model, they share the same knowledge gaps. A factual error that the model "believes" will pass both the generation and the critique. Mitigate by using tool-augmented critics that can verify claims against external sources.

Style over substance. Critics trained on writing quality tend to "improve" outputs by making them longer, more hedged, and more generic. Each revision pass adds qualifiers and removes directness. Mitigate with dimension locks — once tone scores above threshold, don't let subsequent revisions change it.

Critique-induced regression. The producer "fixes" issue A but introduces issue B. The critic catches B but the fix reintroduces A. This oscillation burns tokens without converging. The convergence detection code above catches this — if the overall score plateaus, stop.

The pattern pairs naturally with [tool-calling loops](/guides/tool-calling-loops-ai-agents) — the critic can call tools to verify the producer's claims. And it integrates with [human-in-the-loop](/guides/human-in-the-loop-ai-agents) patterns — reflection as a pre-filter reduces the volume of decisions that need human judgment, making HITL systems practical at scale.

The goal isn't autonomous perfection. It's systematic improvement — each pass catches errors the previous pass missed, and the structured schema ensures you can measure and improve the system over time.

Sources