Agent Stack/Observatory/Benchmarks

Agent Benchmark Tracker

Task-level performance snapshots across autonomous agent systems. Every entry is sourced and dated — this is a historical record, not a leaderboard.

5 tasks tracked · updated as results are published

Task

Best Agent / Model

Score

Measured

Software engineering

SWE-bench Verified subset. Score = % of issues resolved end-to-end.

↗ swebench.com

Devin 2.0

70.3%

Dec 2025

Long-horizon planning

τ-bench (tau-bench) retail and airline domains. Measures task completion in long agentic runs.

↗ github.com

Claude 3.5 Sonnet + τ-bench setup

54.2%

Nov 2025

Research synthesis

GAIA benchmark Level 1-3. Score = % of questions answered correctly.

↗ huggingface.co

Gemini 2.0 + GAIA agent

75.9%

Oct 2025

Code generation

HumanEval pass@1 score.

↗ github.com

Claude 3.5 Sonnet

96.7%

Sep 2025

Web navigation

WebVoyager benchmark on 15 real-world websites.

↗ arxiv.org

GPT-4V + WebVoyager agent

87.4%

Jun 2025

Methodology

Every entry requires a public source URL. Scores are reported as published by benchmark authors — we do not normalize or adjust. Best score = the highest result published on the benchmark at time of entry, attributed to the model that achieved it. When benchmarks update, we add new entries rather than overwriting old ones to preserve the historical record.

Agent Stack/Observatory/Benchmarks

Agent Benchmark Tracker

Task-level performance snapshots across autonomous agent systems. Every entry is sourced and dated — this is a historical record, not a leaderboard.

5 tasks tracked · updated as results are published

Task

Best Agent / Model

Score

Measured

Software engineering

SWE-bench Verified subset. Score = % of issues resolved end-to-end.

↗ swebench.com

Devin 2.0

70.3%

Dec 2025

Long-horizon planning

τ-bench (tau-bench) retail and airline domains. Measures task completion in long agentic runs.

↗ github.com

Claude 3.5 Sonnet + τ-bench setup

54.2%

Nov 2025

Research synthesis

GAIA benchmark Level 1-3. Score = % of questions answered correctly.

↗ huggingface.co

Gemini 2.0 + GAIA agent

75.9%

Oct 2025

Code generation

HumanEval pass@1 score.

↗ github.com

Claude 3.5 Sonnet

96.7%

Sep 2025

Web navigation

WebVoyager benchmark on 15 real-world websites.

↗ arxiv.org

GPT-4V + WebVoyager agent

87.4%

Jun 2025

Methodology