Eval & Testing — Agentifact

12 tools

Sort:

Fiddler AI

82

Trust score

Enterprise ML monitoring and explainability platform covering model performance, data drift, bias detection, and LLM evaluation. Provides a unified dashboard for tracking production models across traditional ML and generative AI. Offers NLP explainability and alert routing integrations.

Datadog LLM Observability

NEEDS APPROVAL

80

Trust score

Datadog's LLM Observability product monitors AI application performance, traces LLM calls end-to-end, and evaluates output quality in production. Integrates natively with OpenAI, Anthropic, and major frameworks like LangChain. Provides latency dashboards, token cost tracking, and automated quality evaluations.

automatic activation of LLM observability charges without explicit opt-in when OpenTelemetry GenAI semantic conventions are detectedcost estimation displays as NA for non-OpenAI models like Gemini+1 more

Stale · Apr 2026

Weights & Biases

NEEDS APPROVAL

78

Trust score

MLOps platform for experiment tracking, model evaluation, and dataset versioning. W&B Weave provides LLM-specific tracing, evaluation frameworks, and dataset management for agent pipelines used by most serious ML teams.

Comet ML

NEEDS APPROVAL

76

Trust score

ML experiment tracking and model management platform with built-in LLM evaluation via Comet Opik. Logs metrics, hyperparameters, and artifacts during training; the Opik module handles prompt versioning and LLM output quality scoring. Integrates with PyTorch, TensorFlow, and popular agent frameworks.

Galileo AI

NEEDS APPROVAL

71

Trust score

LLM evaluation and monitoring platform with automated hallucination detection, response quality scoring, and production drift alerts. Provides a data-flywheel for continuous prompt improvement using evaluation feedback loops.

Cleanlab

FULL AUTO

70

Trust score

Data-centric AI platform that automatically detects label errors, data quality issues, and trustworthiness scores in ML datasets and LLM outputs. Provides the open-source cleanlab library plus a hosted Studio for teams. Particularly effective for improving training data quality before fine-tuning.

Logfire

70

Trust score

OpenTelemetry-native observability platform from the Pydantic team, designed for Python AI applications and LLM agents. Provides structured logging, distributed traces, and dashboards with first-class support for Pydantic AI, FastAPI, and HTTPX. Simple SDK with one-line setup for agent call tracing.

New Relic AI Monitoring

70

Trust score

AI Monitoring in New Relic APM that traces LLM calls end-to-end, tracks token consumption, response times, and model costs alongside application performance. Auto-instruments OpenAI, Anthropic, and Bedrock integrations.

Evidently AI

FULL AUTO

69

Trust score

Open-source ML monitoring library for detecting data drift, model degradation, and LLM output quality changes. Generates visual reports and dashboards for production model health. Integrates with any ML stack via Python SDK.

WhyLabs

67

Trust score

AI observability platform that monitors ML models and LLM applications for data drift, hallucinations, and policy violations in real time. Uses lightweight statistical profiling (whylogs) to capture data quality metrics without storing raw inputs. Supports Python SDK integration and configurable alerting.

Arthur AI

NEEDS APPROVAL

64

Trust score

Enterprise ML monitoring and LLM evaluation platform. Provides real-time hallucination detection, toxicity monitoring, bias analysis, and performance dashboards for production AI systems with enterprise compliance requirements.

Grafana LLM Observability

60

Trust score

Grafana plugin for LLM application metrics. Integrates with OpenTelemetry traces from LangChain, OpenAI, and Anthropic — visualizes token usage, latency histograms, error rates, and cost trends in Grafana dashboards.