Skip to content
Agentifact
ToolsBlueprintsBugsTrending
Submit a Tool+

How we score tools

Every score on Agentifact is an editorial assessment. There is no algorithm generating these numbers from metadata. A human with direct experience testing, integrating, and breaking these tools makes the final call on every dimension.

Five dimensions, weighted

Each tool is scored across five dimensions. The weights reflect what matters most when you're building with agents in production — not in a demo.

Agent Readiness25%

“Can my agent actually use this without me babysitting it?”

The single most important dimension. A tool can be excellent for human developers and completely unusable for agents. This measures whether the tool works when no human is watching.

Measures: API reliability, structured output quality, error handling for agent consumption, latency under automated load, support for tool-calling patterns

ScoreWhat it means
0-39No API, UI-only, or API is so unreliable that agents cannot depend on it.
40-59Primarily a human UI tool with partial API. Agents can use it with significant effort.
60-79API exists and works but not designed for autonomous use. Requires wrapper logic.
80-89Strong API, mostly structured outputs, some async support. Minor manual config.
90-100API-first design, structured JSON outputs, MCP or tool_use integration, <500ms median latency, zero manual steps.
Trust20%

“Will this tool still work the same way next month, and is my data safe?”

Agent systems compound dependencies fast. If a tool ships breaking changes without warning, your whole pipeline goes down. Trust measures whether you can depend on this tool over time.

Measures: Uptime track record, data privacy posture, company maturity, funding stability, transparent incident history

ScoreWhat it means
0-39Abandoned, significant security incidents, no privacy policy, or company dissolved.
40-59Anonymous maintainer or recent pivot. No uptime data, unclear data handling.
60-79Active but limited history (<1 year). Incomplete uptime data, basic privacy docs.
80-89Identifiable team, documented uptime, clear privacy policy. Minor incidents handled transparently.
90-100Verified company, 99.9%+ uptime SLA, public status page, SOC2 or equivalent, transparent incident postmortems.
Interop20%

“Does this plug into the tools I already use, or am I building glue code?”

The best tool in isolation is useless if it can't talk to anything else. Interop scores how well a tool plays with the broader agent ecosystem — protocols, frameworks, and other services.

Measures: Protocol support (MCP, A2A, REST), verified integrations, webhook/event support, framework compatibility

ScoreWhat it means
0-39Proprietary only, no API, no integrations.
40-59Partial API, limited or no documented integrations.
60-79REST API only, no agent-native protocols. Requires custom wrapper.
80-89REST API + one agent protocol (MCP or A2A) + SDK. 5+ framework integrations.
90-100MCP server + A2A + REST + webhooks + official SDKs in Python and TypeScript. 10+ verified integrations.
Security20%

“If my agent goes off-script, what stops it from doing damage?”

Agents make autonomous decisions. That means the blast radius of a security failure is larger than with human-operated tools. This dimension measures the guardrails, not just the locks on the front door.

Measures: Sandboxing/isolation, permission scoping, audit trails, governance controls, vulnerability history

ScoreWhat it means
0-39Known unresolved vulnerabilities, no auth, or agent actions are completely uncontrolled.
40-59Shared credentials or weak auth. No audit trail, no permission scoping.
60-79Standard API key auth, no sandboxing, minimal audit capability.
80-89Good auth (OAuth/API key scoping), partial audit trail, permission docs. No known critical CVEs.
90-100Execution sandboxing (microVM), scoped permissions (least privilege), full audit trail, public CVE history with fast remediation.
Docs15%

“When something breaks at 2am, can I figure out the fix from the docs?”

Documentation quality is the difference between a 10-minute fix and a 4-hour debugging session. We weight it lower than the operational dimensions, but bad docs will drag any tool's score down.

Measures: API reference completeness, working code examples, error code documentation, changelog currency

ScoreWhat it means
0-39No meaningful documentation, or docs describe a completely different version.
40-59Minimal docs, no runnable examples, changelog absent or years old.
60-79Partial API docs, some examples (may be outdated), changelog sparse.
80-89Good API reference, working examples in at least one language. Changelog current within 90 days.
90-100Complete API reference, runnable examples in 2+ languages, documented error codes, up-to-date changelog, dedicated agent integration guide.

Score ranges

The composite score is a weighted average of all five dimensions, rounded to the nearest integer. Here's what the ranges mean.

90–100
Verified
Consistently reliable across all dimensions. Rare.
80–89
Strong
Trusted by the builder community for production agent systems.
60–79
Solid
Solid for most agent workflows. Known limitations are manageable.
40–59
Caution
Usable with significant caveats. Document the risks before deploying.
0–39
Risk
Not recommended for agent use. Major gaps in one or more dimensions.

How scores are determined

Each score is an editorial assessment informed by direct testing — installing the tool, running it against real agent workflows, and deliberately trying to break it. We supplement our testing with builder community reports, public incident history, documentation audits, and integration testing across major frameworks.

Scores are re-verified on a rolling basis. The verification date is displayed on every tool card. If a tool ships a major update or has a significant incident, we re-score within 72 hours. Otherwise, full re-verification happens monthly.

What scores are not

  • —Not endorsements
  • —Not affiliate signals or pay-to-play rankings
  • —Not algorithmically generated from metadata
  • —Not influenced by vendor relationships, funding, or advertising

A tool with a score of 60 and a clear verdict about when to use it is more useful than a tool with a score of 90 and no context. The editorial verdict always carries more weight than the number.

Live example: E2B Code Interpreter

Here's how the methodology applies to a real tool. E2B scores 88 — Strong tier.

E2B Code Interpreter
88
Strong
Agent Readiness — 25%90

Structured output, fast cold starts, clean error responses. Agents can call and parse without custom handling.

Trust — 20%88

Well-funded, consistent uptime, transparent about incidents. API stability over the past 12 months is excellent.

Interop — 20%84

MCP server available. Works with LangGraph, CrewAI, and most orchestrators. A2A support is pending.

Security — 20%91

Full sandbox isolation is the core product. Each execution runs in a separate Firecracker microVM. Highest Security score in the index.

Docs — 15%86

Clear API reference, working code samples in Python and JS. Changelog is current. Error code docs could be more comprehensive.

Approval Control Mode

This field is separate from trust score. It answers: “If I give this to my agent, will it run fully autonomously, or require human confirmation?”

FULL AUTO

Can be delegated to an AI agent with no human confirmation.

NEEDS APPROVAL

Requires at least one approval gate before execution.

HUMAN IN LOOP

Human involvement is required through the workflow.

N/A

Approval mode is not applicable for this tool type.

Machine-queryable endpoint

GET /api/tools

Query parameters:

categoryMCP_SERVER | HITL_PROVIDER | A2A_AGENT | FRAMEWORK
approvalModeFULL_AUTO | REQUIRES_APPROVAL | HUMAN_IN_LOOP
minScore0-100 — minimum composite trust score
mcptrue — only MCP-compatible tools
workflowbooking | data-processing | customer-service | ...
limit1-100 (default 20)

Stale policy

Listings are automatically flagged Stale when more than 30 days have passed since the last verified date. We re-verify stale listings on a rolling schedule, highest traffic first. If a tool ships a major update or has a significant incident, we re-score within 72 hours.

← Back to AgentifactTry the API →
Agentifact

The trust index for the agent economy. Every tool scored on agent-readiness, trust, interoperability, security, and documentation quality.

Explore
  • Tools
  • Blueprints
  • Bugs
  • Builders
  • Trending
  • Replacements
Reference
  • Skills
  • Integrations
  • Lexicon
  • Sources
  • Guides
Community
  • Voices
  • Benchmarks
  • Stack Layers
Company
  • About
  • Methodology
  • Submit a Tool
  • Contact
  • Disclosure
  • Privacy
  • Terms
Quick filtersNew This WeekFree Tools
© 2026 Agentifact. Independent editorial. Scores verified against live infrastructure.
PrivacyTermsSitemap