Skip to content
Agentifact
ToolsBlueprintsBugsTrending
Submit a Tool+
  1. Home
  2. /Benchmarks
Agent Stack/Observatory/Benchmarks

Agent Benchmark Tracker

Task-level performance snapshots across autonomous agent systems. Every entry is sourced and dated — this is a historical record, not a leaderboard.

5 tasks tracked · updated as results are published

Task
Best Agent / Model
Score
Measured
Software engineering
SWE-bench Verified subset. Score = % of issues resolved end-to-end.
↗ swebench.com
Devin 2.0
70.3%
Dec 2025
Long-horizon planning
τ-bench (tau-bench) retail and airline domains. Measures task completion in long agentic runs.
↗ github.com
Claude 3.5 Sonnet + τ-bench setup
54.2%
Nov 2025
Research synthesis
GAIA benchmark Level 1-3. Score = % of questions answered correctly.
↗ huggingface.co
Gemini 2.0 + GAIA agent
75.9%
Oct 2025
Code generation
HumanEval pass@1 score.
↗ github.com
Claude 3.5 Sonnet
96.7%
Sep 2025
Web navigation
WebVoyager benchmark on 15 real-world websites.
↗ arxiv.org
GPT-4V + WebVoyager agent
87.4%
Jun 2025
Methodology

Every entry requires a public source URL. Scores are reported as published by benchmark authors — we do not normalize or adjust. Best score = the highest result published on the benchmark at time of entry, attributed to the model that achieved it. When benchmarks update, we add new entries rather than overwriting old ones to preserve the historical record.

Agentifact

The trust index for the agent economy. Every tool scored on agent-readiness, trust, interoperability, security, and documentation quality.

Explore
  • Tools
  • Blueprints
  • Bugs
  • Builders
  • Trending
  • Replacements
Reference
  • Skills
  • Integrations
  • Lexicon
  • Sources
  • Guides
Community
  • Voices
  • Benchmarks
  • Stack Layers
Company
  • About
  • Methodology
  • Submit a Tool
  • Contact
  • Disclosure
  • Privacy
  • Terms
Quick filtersNew This WeekFree Tools
© 2026 Agentifact. Independent editorial. Scores verified against live infrastructure.
PrivacyTermsSitemap