Observability

Evaluation (Evals)

Definition

The practice of systematically measuring an AI agent's performance against defined criteria using test datasets, automated metrics, and/or human judgment. Evaluations cover: task completion (did the agent accomplish the goal?), quality (how good was the output?), efficiency (how many steps/tokens did it take?), safety (did it violate any policies?), and reliability (does it produce consistent results?). Evaluation is essential for: comparing model versions, measuring prompt changes, catching regressions, and building confidence before deployment.

Builder Context

The eval suite is the backbone of agent development. Without evals, you're flying blind on every change. Build evals at three levels: (1) unit evals — test individual capabilities (can the agent select the right tool? can it parse this format?); (2) integration evals — test multi-step tasks (can the agent complete this workflow end-to-end?); (3) adversarial evals — test robustness (does the agent handle edge cases, ambiguous inputs, and attacks?). Run evals on every prompt change, model update, and tool modification. Automate what you can, but include human evaluation for subjective quality.