Agent Benchmark Tracker
Task-level performance snapshots across autonomous agent systems. Every entry is sourced and dated — this is a historical record, not a leaderboard.
5 tasks tracked · updated as results are published
Methodology
Every entry requires a public source URL. Scores are reported as published by benchmark authors — we do not normalize or adjust. Best score = the highest result published on the benchmark at time of entry, attributed to the model that achieved it. When benchmarks update, we add new entries rather than overwriting old ones to preserve the historical record.