Skip to content
Agentifact
ToolsBlueprintsBugsTrending
Submit a Tool+
  1. Tools
  2. /Eval & Testing
RelatedBlueprintsBugsReplacements

Category

Eval & Testing

12 toolsAvg score 71

LLM evaluation frameworks, agent benchmarks, quality gates, and adversarial test suites for validating production agent behavior.

Filters

We only list tools that meet minimum quality standards.

12 tools

Sort:
Fiddler AI logo

Fiddler AI

82
Trust score

Enterprise ML monitoring and explainability platform covering model performance, data drift, bias detection, and LLM evaluation. Provides a unified dashboard for tracking production models across traditional ML and generative AI. Offers NLP explainability and alert routing integrations.

AGENT
92
TRUST
85
INTEROP
72
SECURE
75
DOCS
85
Verified Apr 2026REST
View details →
Datadog LLM Observability logo

Datadog LLM Observability

NEEDS APPROVAL
80
Trust score

Datadog's LLM Observability product monitors AI application performance, traces LLM calls end-to-end, and evaluates output quality in production. Integrates natively with OpenAI, Anthropic, and major frameworks like LangChain. Provides latency dashboards, token cost tracking, and automated quality evaluations.

AGENT
85
TRUST
92
INTEROP
75
SECURE
75
DOCS
72
automatic activation of LLM observability charges without explicit opt-in when OpenTelemetry GenAI semantic conventions are detectedcost estimation displays as NA for non-OpenAI models like Gemini+1 more
Verified Apr 2026
View details →
Weights & Biases logo

Weights & Biases

NEEDS APPROVAL
78
Trust score

MLOps platform for experiment tracking, model evaluation, and dataset versioning. W&B Weave provides LLM-specific tracing, evaluation frameworks, and dataset management for agent pipelines used by most serious ML teams.

AGENT
85
TRUST
92
INTEROP
65
SECURE
72
DOCS
75
Verified Apr 2026REST
View details →
Comet ML logo

Comet ML

NEEDS APPROVAL
76
Trust score

ML experiment tracking and model management platform with built-in LLM evaluation via Comet Opik. Logs metrics, hyperparameters, and artifacts during training; the Opik module handles prompt versioning and LLM output quality scoring. Integrates with PyTorch, TensorFlow, and popular agent frameworks.

AGENT
85
TRUST
82
INTEROP
85
SECURE
65
DOCS
65
Verified Apr 2026MCP
View details →
Galileo AI logo

Galileo AI

NEEDS APPROVAL
71
Trust score

LLM evaluation and monitoring platform with automated hallucination detection, response quality scoring, and production drift alerts. Provides a data-flywheel for continuous prompt improvement using evaluation feedback loops.

AGENT
85
TRUST
85
INTEROP
72
SECURE
72
DOCS
40
Verified Apr 2026REST
View details →
Cleanlab logo

Cleanlab

FULL AUTO
70
Trust score

Data-centric AI platform that automatically detects label errors, data quality issues, and trustworthiness scores in ML datasets and LLM outputs. Provides the open-source cleanlab library plus a hosted Studio for teams. Particularly effective for improving training data quality before fine-tuning.

AGENT
72
TRUST
65
INTEROP
65
SECURE
65
DOCS
85
Verified Apr 2026
View details →
Logfire logo

Logfire

70
Trust score

OpenTelemetry-native observability platform from the Pydantic team, designed for Python AI applications and LLM agents. Provides structured logging, distributed traces, and dashboards with first-class support for Pydantic AI, FastAPI, and HTTPX. Simple SDK with one-line setup for agent call tracing.

AGENT
65
TRUST
65
INTEROP
72
SECURE
85
DOCS
65
Verified Apr 2026REST
View details →
New Relic AI Monitoring logo

New Relic AI Monitoring

70
Trust score

AI Monitoring in New Relic APM that traces LLM calls end-to-end, tracks token consumption, response times, and model costs alongside application performance. Auto-instruments OpenAI, Anthropic, and Bedrock integrations.

AGENT
45
TRUST
85
INTEROP
85
SECURE
70
DOCS
65
Verified Apr 2026MCP
View details →
Evidently AI logo

Evidently AI

FULL AUTO
69
Trust score

Open-source ML monitoring library for detecting data drift, model degradation, and LLM output quality changes. Generates visual reports and dashboards for production model health. Integrates with any ML stack via Python SDK.

AGENT
72
TRUST
72
INTEROP
75
SECURE
60
DOCS
65
Verified Apr 2026REST
View details →
WhyLabs logo

WhyLabs

67
Trust score

AI observability platform that monitors ML models and LLM applications for data drift, hallucinations, and policy violations in real time. Uses lightweight statistical profiling (whylogs) to capture data quality metrics without storing raw inputs. Supports Python SDK integration and configurable alerting.

AGENT
65
TRUST
65
INTEROP
75
SECURE
65
DOCS
65
Verified Apr 2026REST
View details →
Arthur AI logo

Arthur AI

NEEDS APPROVAL
64
Trust score

Enterprise ML monitoring and LLM evaluation platform. Provides real-time hallucination detection, toxicity monitoring, bias analysis, and performance dashboards for production AI systems with enterprise compliance requirements.

AGENT
75
TRUST
75
INTEROP
45
SECURE
85
DOCS
40
Verified Apr 2026REST
View details →
Grafana LLM Observability logo

Grafana LLM Observability

60
Trust score

Grafana plugin for LLM application metrics. Integrates with OpenTelemetry traces from LangChain, OpenAI, and Anthropic — visualizes token usage, latency histograms, error rates, and cost trends in Grafana dashboards.

AGENT
35
TRUST
85
INTEROP
45
SECURE
72
DOCS
65
Verified Apr 2026REST
View details →

Explore by category

MCP ServersHITL ProvidersA2A AgentsFrameworks57Workflow TemplatesProtocols29
Agentifact

The trust index for the agent economy. Every tool scored on agent-readiness, trust, interoperability, security, and documentation quality.

Explore
  • Tools
  • Blueprints
  • Bugs
  • Builders
  • Trending
  • Replacements
Reference
  • Skills
  • Integrations
  • Lexicon
  • Sources
  • Guides
Community
  • Voices
  • Benchmarks
  • Stack Layers
Company
  • About
  • Methodology
  • Submit a Tool
  • Contact
  • Disclosure
  • Privacy
  • Terms
Quick filtersNew This WeekFree Tools
© 2026 Agentifact. Independent editorial. Scores verified against live infrastructure.
PrivacyTermsSitemap