How we score tools

Every score on Agentifact is an editorial assessment. There is no algorithm generating these numbers from metadata. A human with direct experience testing, integrating, and breaking these tools makes the final call on every dimension.

Five dimensions, weighted

Each tool is scored across five dimensions. The weights reflect what matters most when you're building with agents in production — not in a demo.

Agent Readiness25%

“Can my agent actually use this without me babysitting it?”

The single most important dimension. A tool can be excellent for human developers and completely unusable for agents. This measures whether the tool works when no human is watching.

Measures: API reliability, structured output quality, error handling for agent consumption, latency under automated load, support for tool-calling patterns

Score	What it means
0-39	No API, UI-only, or API is so unreliable that agents cannot depend on it.
40-59	Primarily a human UI tool with partial API. Agents can use it with significant effort.
60-79	API exists and works but not designed for autonomous use. Requires wrapper logic.
80-89	Strong API, mostly structured outputs, some async support. Minor manual config.
90-100	API-first design, structured JSON outputs, MCP or tool_use integration, <500ms median latency, zero manual steps.

Trust20%

“Will this tool still work the same way next month, and is my data safe?”

Agent systems compound dependencies fast. If a tool ships breaking changes without warning, your whole pipeline goes down. Trust measures whether you can depend on this tool over time.

Measures: Uptime track record, data privacy posture, company maturity, funding stability, transparent incident history

Score	What it means
0-39	Abandoned, significant security incidents, no privacy policy, or company dissolved.
40-59	Anonymous maintainer or recent pivot. No uptime data, unclear data handling.
60-79	Active but limited history (<1 year). Incomplete uptime data, basic privacy docs.
80-89	Identifiable team, documented uptime, clear privacy policy. Minor incidents handled transparently.
90-100	Verified company, 99.9%+ uptime SLA, public status page, SOC2 or equivalent, transparent incident postmortems.

Interop20%

“Does this plug into the tools I already use, or am I building glue code?”

The best tool in isolation is useless if it can't talk to anything else. Interop scores how well a tool plays with the broader agent ecosystem — protocols, frameworks, and other services.

Measures: Protocol support (MCP, A2A, REST), verified integrations, webhook/event support, framework compatibility

Score	What it means
0-39	Proprietary only, no API, no integrations.
40-59	Partial API, limited or no documented integrations.
60-79	REST API only, no agent-native protocols. Requires custom wrapper.
80-89	REST API + one agent protocol (MCP or A2A) + SDK. 5+ framework integrations.
90-100	MCP server + A2A + REST + webhooks + official SDKs in Python and TypeScript. 10+ verified integrations.

Security20%

“If my agent goes off-script, what stops it from doing damage?”

Agents make autonomous decisions. That means the blast radius of a security failure is larger than with human-operated tools. This dimension measures the guardrails, not just the locks on the front door.

Measures: Sandboxing/isolation, permission scoping, audit trails, governance controls, vulnerability history

Score	What it means
0-39	Known unresolved vulnerabilities, no auth, or agent actions are completely uncontrolled.
40-59	Shared credentials or weak auth. No audit trail, no permission scoping.
60-79	Standard API key auth, no sandboxing, minimal audit capability.
80-89	Good auth (OAuth/API key scoping), partial audit trail, permission docs. No known critical CVEs.
90-100	Execution sandboxing (microVM), scoped permissions (least privilege), full audit trail, public CVE history with fast remediation.

Docs15%

“When something breaks at 2am, can I figure out the fix from the docs?”

Documentation quality is the difference between a 10-minute fix and a 4-hour debugging session. We weight it lower than the operational dimensions, but bad docs will drag any tool's score down.

Measures: API reference completeness, working code examples, error code documentation, changelog currency

Score	What it means
0-39	No meaningful documentation, or docs describe a completely different version.
40-59	Minimal docs, no runnable examples, changelog absent or years old.
60-79	Partial API docs, some examples (may be outdated), changelog sparse.
80-89	Good API reference, working examples in at least one language. Changelog current within 90 days.
90-100	Complete API reference, runnable examples in 2+ languages, documented error codes, up-to-date changelog, dedicated agent integration guide.

Score ranges

The composite score is a weighted average of all five dimensions, rounded to the nearest integer. Here's what the ranges mean.

90–100

Verified

Consistently reliable across all dimensions. Rare.

80–89

Strong

Trusted by the builder community for production agent systems.

60–79

Solid

Solid for most agent workflows. Known limitations are manageable.

40–59

Caution

Usable with significant caveats. Document the risks before deploying.

0–39

Risk

Not recommended for agent use. Major gaps in one or more dimensions.

How scores are determined

Each score is an editorial assessment informed by direct testing — installing the tool, running it against real agent workflows, and deliberately trying to break it. We supplement our testing with builder community reports, public incident history, documentation audits, and integration testing across major frameworks.

Scores are re-verified on a rolling basis. The verification date is displayed on every tool card. If a tool ships a major update or has a significant incident, we re-score within 72 hours. Otherwise, full re-verification happens monthly.

What scores are not

—Not endorsements
—Not affiliate signals or pay-to-play rankings
—Not algorithmically generated from metadata
—Not influenced by vendor relationships, funding, or advertising

A tool with a score of 60 and a clear verdict about when to use it is more useful than a tool with a score of 90 and no context. The editorial verdict always carries more weight than the number.

Live example: E2B Code Interpreter

Here's how the methodology applies to a real tool. E2B scores 88 — Strong tier.

E2B Code Interpreter

88

Strong

Agent Readiness — 25%90

Structured output, fast cold starts, clean error responses. Agents can call and parse without custom handling.

Trust — 20%88

Well-funded, consistent uptime, transparent about incidents. API stability over the past 12 months is excellent.

Interop — 20%84

MCP server available. Works with LangGraph, CrewAI, and most orchestrators. A2A support is pending.

Security — 20%91

Full sandbox isolation is the core product. Each execution runs in a separate Firecracker microVM. Highest Security score in the index.

Docs — 15%86

Clear API reference, working code samples in Python and JS. Changelog is current. Error code docs could be more comprehensive.

Approval Control Mode

This field is separate from trust score. It answers: “If I give this to my agent, will it run fully autonomously, or require human confirmation?”

FULL AUTO

Can be delegated to an AI agent with no human confirmation.

NEEDS APPROVAL

Requires at least one approval gate before execution.

HUMAN IN LOOP

Human involvement is required through the workflow.

N/A

Approval mode is not applicable for this tool type.

Machine-queryable endpoint

GET /api/tools

Query parameters:

category	MCP_SERVER \| HITL_PROVIDER \| A2A_AGENT \| FRAMEWORK
approvalMode	FULL_AUTO \| REQUIRES_APPROVAL \| HUMAN_IN_LOOP
minScore	0-100 — minimum composite trust score
mcp	true — only MCP-compatible tools
workflow	booking \| data-processing \| customer-service \| ...
limit	1-100 (default 20)

Stale policy

Listings are automatically flagged Stale when more than 30 days have passed since the last verified date. We re-verify stale listings on a rolling schedule, highest traffic first. If a tool ships a major update or has a significant incident, we re-score within 72 hours.

← Back to Agentifact Try the API →

How we score tools

Five dimensions, weighted

Each tool is scored across five dimensions. The weights reflect what matters most when you're building with agents in production — not in a demo.

Agent Readiness25%

“Can my agent actually use this without me babysitting it?”

The single most important dimension. A tool can be excellent for human developers and completely unusable for agents. This measures whether the tool works when no human is watching.

Measures: API reliability, structured output quality, error handling for agent consumption, latency under automated load, support for tool-calling patterns

Score	What it means
0-39	No API, UI-only, or API is so unreliable that agents cannot depend on it.
40-59	Primarily a human UI tool with partial API. Agents can use it with significant effort.
60-79	API exists and works but not designed for autonomous use. Requires wrapper logic.
80-89	Strong API, mostly structured outputs, some async support. Minor manual config.
90-100	API-first design, structured JSON outputs, MCP or tool_use integration, <500ms median latency, zero manual steps.

Trust20%

“Will this tool still work the same way next month, and is my data safe?”

Agent systems compound dependencies fast. If a tool ships breaking changes without warning, your whole pipeline goes down. Trust measures whether you can depend on this tool over time.

Measures: Uptime track record, data privacy posture, company maturity, funding stability, transparent incident history

Score	What it means
0-39	Abandoned, significant security incidents, no privacy policy, or company dissolved.
40-59	Anonymous maintainer or recent pivot. No uptime data, unclear data handling.
60-79	Active but limited history (<1 year). Incomplete uptime data, basic privacy docs.
80-89	Identifiable team, documented uptime, clear privacy policy. Minor incidents handled transparently.
90-100	Verified company, 99.9%+ uptime SLA, public status page, SOC2 or equivalent, transparent incident postmortems.

Interop20%

“Does this plug into the tools I already use, or am I building glue code?”

The best tool in isolation is useless if it can't talk to anything else. Interop scores how well a tool plays with the broader agent ecosystem — protocols, frameworks, and other services.

Measures: Protocol support (MCP, A2A, REST), verified integrations, webhook/event support, framework compatibility

Score	What it means
0-39	Proprietary only, no API, no integrations.
40-59	Partial API, limited or no documented integrations.
60-79	REST API only, no agent-native protocols. Requires custom wrapper.
80-89	REST API + one agent protocol (MCP or A2A) + SDK. 5+ framework integrations.
90-100	MCP server + A2A + REST + webhooks + official SDKs in Python and TypeScript. 10+ verified integrations.

Security20%

“If my agent goes off-script, what stops it from doing damage?”

Measures: Sandboxing/isolation, permission scoping, audit trails, governance controls, vulnerability history

Score	What it means
0-39	Known unresolved vulnerabilities, no auth, or agent actions are completely uncontrolled.
40-59	Shared credentials or weak auth. No audit trail, no permission scoping.
60-79	Standard API key auth, no sandboxing, minimal audit capability.
80-89	Good auth (OAuth/API key scoping), partial audit trail, permission docs. No known critical CVEs.
90-100	Execution sandboxing (microVM), scoped permissions (least privilege), full audit trail, public CVE history with fast remediation.

Docs15%

“When something breaks at 2am, can I figure out the fix from the docs?”

Documentation quality is the difference between a 10-minute fix and a 4-hour debugging session. We weight it lower than the operational dimensions, but bad docs will drag any tool's score down.

Measures: API reference completeness, working code examples, error code documentation, changelog currency

Score	What it means
0-39	No meaningful documentation, or docs describe a completely different version.
40-59	Minimal docs, no runnable examples, changelog absent or years old.
60-79	Partial API docs, some examples (may be outdated), changelog sparse.
80-89	Good API reference, working examples in at least one language. Changelog current within 90 days.
90-100	Complete API reference, runnable examples in 2+ languages, documented error codes, up-to-date changelog, dedicated agent integration guide.

Score ranges

The composite score is a weighted average of all five dimensions, rounded to the nearest integer. Here's what the ranges mean.

90–100

Verified

Consistently reliable across all dimensions. Rare.

80–89

Strong

Trusted by the builder community for production agent systems.

60–79

Solid

Solid for most agent workflows. Known limitations are manageable.

40–59

Caution

Usable with significant caveats. Document the risks before deploying.

0–39

Risk

Not recommended for agent use. Major gaps in one or more dimensions.

How scores are determined

What scores are not

—Not endorsements
—Not affiliate signals or pay-to-play rankings
—Not algorithmically generated from metadata
—Not influenced by vendor relationships, funding, or advertising

A tool with a score of 60 and a clear verdict about when to use it is more useful than a tool with a score of 90 and no context. The editorial verdict always carries more weight than the number.

Live example: E2B Code Interpreter

Here's how the methodology applies to a real tool. E2B scores 88 — Strong tier.

E2B Code Interpreter

88

Strong

Agent Readiness — 25%90

Structured output, fast cold starts, clean error responses. Agents can call and parse without custom handling.

Trust — 20%88

Well-funded, consistent uptime, transparent about incidents. API stability over the past 12 months is excellent.

Interop — 20%84

MCP server available. Works with LangGraph, CrewAI, and most orchestrators. A2A support is pending.

Security — 20%91

Full sandbox isolation is the core product. Each execution runs in a separate Firecracker microVM. Highest Security score in the index.

Docs — 15%86

Clear API reference, working code samples in Python and JS. Changelog is current. Error code docs could be more comprehensive.

Approval Control Mode

This field is separate from trust score. It answers: “If I give this to my agent, will it run fully autonomously, or require human confirmation?”

FULL AUTO

Can be delegated to an AI agent with no human confirmation.

NEEDS APPROVAL

Requires at least one approval gate before execution.

HUMAN IN LOOP

Human involvement is required through the workflow.

N/A

Approval mode is not applicable for this tool type.

Machine-queryable endpoint

GET /api/tools

Query parameters:

category	MCP_SERVER \| HITL_PROVIDER \| A2A_AGENT \| FRAMEWORK
approvalMode	FULL_AUTO \| REQUIRES_APPROVAL \| HUMAN_IN_LOOP
minScore	0-100 — minimum composite trust score
mcp	true — only MCP-compatible tools
workflow	booking \| data-processing \| customer-service \| ...
limit	1-100 (default 20)

Stale policy

← Back to Agentifact Try the API →