Security

Safety Evaluation

Definition

Systematic testing and measurement of an AI agent's behavior against safety criteria — covering factual accuracy, harmful content avoidance, instruction following, bias detection, and robustness to adversarial inputs. Safety evaluation combines automated metrics (hallucination rate, toxicity scores, compliance with safety policies) with human evaluation (red teaming, user studies). It is distinct from capability evaluation (which measures how well the agent performs tasks) — an agent can be highly capable but unsafe.

Builder Context

Build safety evaluation into your CI/CD pipeline, not as a one-time pre-launch check. Maintain a growing test suite of: (1) known failure cases (prompts that previously caused issues), (2) adversarial inputs (prompt injections, jailbreaks), (3) boundary cases (ambiguous requests, conflicting instructions), and (4) regression tests (verify that safety patches don't break). Track safety metrics over time — a declining safety score on a stable test suite indicates model drift or prompt degradation.