How we score tools
Every score on Agentifact is an editorial assessment. There is no algorithm generating these numbers from metadata. A human with direct experience testing, integrating, and breaking these tools makes the final call on every dimension.
Five dimensions, weighted
Each tool is scored across five dimensions. The weights reflect what matters most when you're building with agents in production — not in a demo.
“Can my agent actually use this without me babysitting it?”
The single most important dimension. A tool can be excellent for human developers and completely unusable for agents. This measures whether the tool works when no human is watching.
Measures: API reliability, structured output quality, error handling for agent consumption, latency under automated load, support for tool-calling patterns
| Score | What it means |
|---|---|
| 0-39 | No API, UI-only, or API is so unreliable that agents cannot depend on it. |
| 40-59 | Primarily a human UI tool with partial API. Agents can use it with significant effort. |
| 60-79 | API exists and works but not designed for autonomous use. Requires wrapper logic. |
| 80-89 | Strong API, mostly structured outputs, some async support. Minor manual config. |
| 90-100 | API-first design, structured JSON outputs, MCP or tool_use integration, <500ms median latency, zero manual steps. |
“Will this tool still work the same way next month, and is my data safe?”
Agent systems compound dependencies fast. If a tool ships breaking changes without warning, your whole pipeline goes down. Trust measures whether you can depend on this tool over time.
Measures: Uptime track record, data privacy posture, company maturity, funding stability, transparent incident history
| Score | What it means |
|---|---|
| 0-39 | Abandoned, significant security incidents, no privacy policy, or company dissolved. |
| 40-59 | Anonymous maintainer or recent pivot. No uptime data, unclear data handling. |
| 60-79 | Active but limited history (<1 year). Incomplete uptime data, basic privacy docs. |
| 80-89 | Identifiable team, documented uptime, clear privacy policy. Minor incidents handled transparently. |
| 90-100 | Verified company, 99.9%+ uptime SLA, public status page, SOC2 or equivalent, transparent incident postmortems. |
“Does this plug into the tools I already use, or am I building glue code?”
The best tool in isolation is useless if it can't talk to anything else. Interop scores how well a tool plays with the broader agent ecosystem — protocols, frameworks, and other services.
Measures: Protocol support (MCP, A2A, REST), verified integrations, webhook/event support, framework compatibility
| Score | What it means |
|---|---|
| 0-39 | Proprietary only, no API, no integrations. |
| 40-59 | Partial API, limited or no documented integrations. |
| 60-79 | REST API only, no agent-native protocols. Requires custom wrapper. |
| 80-89 | REST API + one agent protocol (MCP or A2A) + SDK. 5+ framework integrations. |
| 90-100 | MCP server + A2A + REST + webhooks + official SDKs in Python and TypeScript. 10+ verified integrations. |
“If my agent goes off-script, what stops it from doing damage?”
Agents make autonomous decisions. That means the blast radius of a security failure is larger than with human-operated tools. This dimension measures the guardrails, not just the locks on the front door.
Measures: Sandboxing/isolation, permission scoping, audit trails, governance controls, vulnerability history
| Score | What it means |
|---|---|
| 0-39 | Known unresolved vulnerabilities, no auth, or agent actions are completely uncontrolled. |
| 40-59 | Shared credentials or weak auth. No audit trail, no permission scoping. |
| 60-79 | Standard API key auth, no sandboxing, minimal audit capability. |
| 80-89 | Good auth (OAuth/API key scoping), partial audit trail, permission docs. No known critical CVEs. |
| 90-100 | Execution sandboxing (microVM), scoped permissions (least privilege), full audit trail, public CVE history with fast remediation. |
“When something breaks at 2am, can I figure out the fix from the docs?”
Documentation quality is the difference between a 10-minute fix and a 4-hour debugging session. We weight it lower than the operational dimensions, but bad docs will drag any tool's score down.
Measures: API reference completeness, working code examples, error code documentation, changelog currency
| Score | What it means |
|---|---|
| 0-39 | No meaningful documentation, or docs describe a completely different version. |
| 40-59 | Minimal docs, no runnable examples, changelog absent or years old. |
| 60-79 | Partial API docs, some examples (may be outdated), changelog sparse. |
| 80-89 | Good API reference, working examples in at least one language. Changelog current within 90 days. |
| 90-100 | Complete API reference, runnable examples in 2+ languages, documented error codes, up-to-date changelog, dedicated agent integration guide. |
Score ranges
The composite score is a weighted average of all five dimensions, rounded to the nearest integer. Here's what the ranges mean.
How scores are determined
Each score is an editorial assessment informed by direct testing — installing the tool, running it against real agent workflows, and deliberately trying to break it. We supplement our testing with builder community reports, public incident history, documentation audits, and integration testing across major frameworks.
Scores are re-verified on a rolling basis. The verification date is displayed on every tool card. If a tool ships a major update or has a significant incident, we re-score within 72 hours. Otherwise, full re-verification happens monthly.
What scores are not
- —Not endorsements
- —Not affiliate signals or pay-to-play rankings
- —Not algorithmically generated from metadata
- —Not influenced by vendor relationships, funding, or advertising
A tool with a score of 60 and a clear verdict about when to use it is more useful than a tool with a score of 90 and no context. The editorial verdict always carries more weight than the number.
Live example: E2B Code Interpreter
Here's how the methodology applies to a real tool. E2B scores 88 — Strong tier.
Structured output, fast cold starts, clean error responses. Agents can call and parse without custom handling.
Well-funded, consistent uptime, transparent about incidents. API stability over the past 12 months is excellent.
MCP server available. Works with LangGraph, CrewAI, and most orchestrators. A2A support is pending.
Full sandbox isolation is the core product. Each execution runs in a separate Firecracker microVM. Highest Security score in the index.
Clear API reference, working code samples in Python and JS. Changelog is current. Error code docs could be more comprehensive.
Approval Control Mode
This field is separate from trust score. It answers: “If I give this to my agent, will it run fully autonomously, or require human confirmation?”
Can be delegated to an AI agent with no human confirmation.
Requires at least one approval gate before execution.
Human involvement is required through the workflow.
Approval mode is not applicable for this tool type.
Machine-queryable endpoint
GET /api/toolsQuery parameters:
| category | MCP_SERVER | HITL_PROVIDER | A2A_AGENT | FRAMEWORK |
| approvalMode | FULL_AUTO | REQUIRES_APPROVAL | HUMAN_IN_LOOP |
| minScore | 0-100 — minimum composite trust score |
| mcp | true — only MCP-compatible tools |
| workflow | booking | data-processing | customer-service | ... |
| limit | 1-100 (default 20) |
Stale policy
Listings are automatically flagged Stale when more than 30 days have passed since the last verified date. We re-verify stale listings on a rolling schedule, highest traffic first. If a tool ships a major update or has a significant incident, we re-score within 72 hours.