Agentifact assessment — independently scored, not sponsored. Last verified Mar 6, 2026.
Rebuff
Open-source prompt injection detector from ProtectAI with a four-layer defense: heuristics to filter suspicious inputs, an LLM-based classifier, a vector database of known attack embeddings, and canary tokens to detect prompt leakage. Integrates via Python SDK. Currently a prototype suitable for research and early-stage agent hardening. Free and self-hosted.
Use with care — notable gaps remain
You're building an LLM agent that accepts user input and need to block prompt injection attacks before they reach your model, without relying on a single detection method.
Fast heuristic checks run first (no API cost), then LLM-based detection kicks in for sophisticated attacks. False positives are possible on legitimate edge-case inputs. Canary token leakage detection requires you to monitor model outputs. The system learns from attacks you log, but you're responsible for feeding it domain-specific attack patterns.
You need to detect when sensitive information (API keys, system prompts, internal data) is being exfiltrated through prompt injection or model output manipulation.
Canary tokens are a honeypot—they don't prevent exfiltration, they detect it after the fact. You still need to decide what to do when leakage is detected (alert, block user, retrain). Works best when you control the prompt template; less effective against indirect injection vectors.
No complete defense against prompt injection
Rebuff's own documentation states there are no known complete solutions to prompt injection. Skilled attackers can discover new vectors or bypass all four layers. The tool raises the bar but doesn't eliminate risk.
Alpha-stage maturity with false positives/negatives
Rebuff is explicitly in alpha. The framework may produce false positives (blocking legitimate inputs) or false negatives (missing real attacks). No production SLA or stability guarantees. Expect breaking changes and API shifts.
LLM-based detection adds latency and cost
The second defense layer calls an LLM (OpenAI by default) to classify each input. This adds ~500ms–2s per request and costs ~$0.001–0.01 per detection depending on input length. For high-volume agents, this becomes a bottleneck. Heuristics run free and fast, but sophisticated attacks require the expensive layer.
Trust Breakdown
What It Actually Does
Detects when users try to manipulate AI agents with malicious prompts by checking incoming text against known attack patterns and suspicious language signatures. You integrate it into your agent's input pipeline to block these attacks before they reach the core system.
Open-source prompt injection detector from ProtectAI with a four-layer defense: heuristics to filter suspicious inputs, an LLM-based classifier, a vector database of known attack embeddings, and canary tokens to detect prompt leakage. Integrates via Python SDK. Currently a prototype suitable for research and early-stage agent hardening.
Free and self-hosted.
Fit Assessment
Best for
- ✓prompt-injection-detection
- ✓llm-security
Score Breakdown
Protocol Support
Capabilities
Governance
- input-validation