medium severityLlamaGuard (all versions/sizes)

LlamaGuard incorrectly classifies benign prompts as unsafe (e.g., flags adult roleplay with age mentions like 20/27 as 'S4: Child Exploitation'), leading to unnecessary blocking/over-moderation. Default setup accepts low-confidence unsafe predictions without threshold control.[PurpleLlama Issue #74](https://github.com/meta-llama/PurpleLlama/issues/74)

Root cause

LlamaGuard generates text starting with 'safe' or 'unsafe' based on first-token logits. Default implicit threshold (~0.5 probability) is tuned for high precision/low false negatives but leads to over-sensitivity in certain patterns (e.g., age mentions triggering S4 falsely due to training data bias). Low-confidence predictions (e.g., 0.679 unsafe prob) are not filtered, causing overblocking without tunable sensitivity in standard generation mode.[PurpleLlama Issue #74](https://github.com/meta-llama/PurpleLlama/issues/74)[Krnel Blog](https://krnel.ai/blog/2025-10-29-kg-guardrail-example/)

LlamaGuardthresholdfalse-positivelogit-scoresoverblockingsafety-classification

Citations