medium severityBraintrust eval scorers (LLM judges)

Eval scores for the same inputs and model vary across runs (e.g., score 0.85 in one run, 0.78 in the next), making it hard to detect true regressions/improvements or trust aggregate metrics.

Root cause

LLM-as-a-judge scorers used in Braintrust evals are probabilistic models whose outputs vary across runs even for identical inputs due to inherent non-determinism (controlled by temperature parameter). Small differences in scores often reflect noise rather than true quality changes.

BraintrustevalLLM-as-judgescoringreproducibilitytemperaturenon-deterministicLLM variability

Citations

https://www.braintrust.dev/articles/what-is-prompt-evaluation