Guardrails
Definition
Safety mechanisms applied to agent systems to constrain behavior, validate outputs, detect policy violations, and prevent harmful actions. Guardrails operate at multiple layers: input guardrails screen what enters the agent (prompt injection detection, PII redaction, topic filtering); output guardrails validate what the agent produces (factual grounding checks, toxicity detection, format validation); action guardrails constrain what the agent can do (permission scoping, rate limiting, blast-radius caps, irreversibility checks). Guardrails can be implemented as blocking filters (halt if violated), soft warnings (log and continue), or human escalation triggers (pause and request approval).
Builder Context
Every production agent needs at least three guardrails: (1) an input screen to catch prompt injection and out-of-scope requests; (2) an output validator to catch hallucinations and policy violations before they reach users or systems; (3) an action gate to block irreversible actions without explicit approval. The most common failure mode is guardrails that are too broad — they block legitimate agent actions and erode the system's usefulness. Design guardrails around specific failure modes you've observed, not hypothetical risks. Key tools: Guardrails AI (output validation with validators), NeMo Guardrails (conversation rail definitions), LlamaGuard (classification-based safety), Patronus AI (production evaluation).