Prompt Injection
Definition
An attack where malicious instructions are embedded in user input, retrieved documents, or tool results — attempting to override the agent's system prompt and redirect its behavior. Prompt injection is the most critical security vulnerability in agent systems because agents act on their instructions: a successful injection can cause the agent to exfiltrate data, call unauthorized tools, or produce harmful outputs. Direct injection embeds instructions in user messages; indirect injection hides instructions in data the agent retrieves (web pages, documents, API responses).
Builder Context
Every agent that processes untrusted input is vulnerable to prompt injection. Defense in depth: (1) input filtering — scan for known injection patterns before they reach the model; (2) privilege separation — run the model with minimal permissions, never give it credentials directly; (3) output validation — check agent actions against an allowlist before execution; (4) tool scoping — each tool should have the minimum permissions needed. The most dangerous pattern: agents that can call tools that modify their own instructions or permissions. Never build that.