Building an Agent Observability Stack That Actually Helps You Debug
Logs aren't enough. Traces aren't enough. Here's the full picture.
The debugging nightmare
A user reports: "The agent gave me wrong information about our product pricing." You have 5 minutes to figure out why. Where do you look?
If your answer is "the logs," you've already lost. Application logs show you what happened. Agent observability shows you why it happened — which prompt was used, what context was injected, which tools were called, what the LLM was "thinking," and where the chain broke down.
The three layers
Layer 1: Trace-level observability
A trace captures the full lifecycle of an agent run, from user input to final output. It includes every LLM call, every tool invocation, every state transition, and every intermediate result.
What you need to capture:
- Input prompt (including system message, user message, and injected context)
- Model used and parameters (temperature, max tokens)
- Output token count and latency
- Tool calls: name, arguments, response, duration
- Errors: type, message, retry count
- Cost per call (computed from token counts and model pricing)
Tools: Langfuse, LangSmith, and Arize Phoenix all provide trace-level views. They differ in pricing, retention, and ecosystem integration, but the core capability is similar.
The key question a trace answers: "For this specific run, what did the agent see, think, and do at each step?"
Layer 2: Aggregate monitoring
Individual traces tell you about individual runs. Aggregate monitoring tells you about your system's health over time.
Metrics to track:
- Success rate: What percentage of agent runs complete successfully? (Define "success" — this is harder than it sounds.)
- Latency distribution: P50, P95, P99 for full runs and individual steps. Agent latency has high variance.
- Cost per run: Track daily, weekly, and monthly. Set alerts for sudden spikes.
- Error rate by type: Which errors are transient (network timeouts) vs systematic (prompt failures)?
- Tool utilization: Which tools are called most? Which fail most? Which are never used? (Unused tools are security surface for no benefit.)
- Token efficiency: Input tokens per run, output tokens per run, cached vs uncached ratio.
Tools: Datadog and Grafana for infrastructure metrics. Langfuse and Helicone for LLM-specific aggregates. Build dashboards that combine both.
The key question aggregates answer: "Is the system healthy right now, and is it trending better or worse?"
Layer 3: Evaluation and quality scoring
Observability tells you what happened. Evaluation tells you whether what happened was good.
Approaches:
- LLM-as-judge: Use a separate LLM to score the quality of your agent's outputs. Fast, scalable, but not always accurate. Works well for: "Is this response relevant?" Less well for: "Is this response factually correct?"
- Human evaluation: Sample agent outputs and have humans rate them. Gold standard for quality, but expensive and slow. Use for calibration.
- Automated checks: Regex patterns, JSON schema validation, factual consistency checks against source documents. Cheap and deterministic, but limited in scope.
- User feedback: Thumbs up/down, explicit corrections. The most valuable signal, but the hardest to collect consistently.
Tools: Langfuse supports model-based evals and score annotation. Arize Phoenix provides eval templates. Braintrust specializes in eval pipelines.
The key question eval answers: "Is the agent getting better or worse over time?"
Setting up the stack
Start here (week 1)
1. Instrument your agent with Langfuse or Arize Phoenix. Both have Python/JS SDKs that wrap your LLM calls automatically.
2. Add trace IDs to every user-facing response. When a user reports an issue, you need to find the trace.
3. Set up a basic dashboard: success rate, avg latency, daily cost, error count.
Add this next (week 2-4)
4. Implement per-tool error tracking. Know which tools fail, how often, and whether retries succeed.
5. Add cost tracking per user, per workflow type, per model.
6. Set up alerts: cost spike (>2x daily average), error rate spike (>20%), latency P99 breach.
7. Build a simple eval pipeline: sample 10% of runs, score with LLM-as-judge, track the score distribution over time.
Production-grade (month 2+)
8. Implement trace comparison — when an agent produces a bad output, compare its trace to traces of good outputs for similar inputs.
9. Add regression testing — re-run historical inputs periodically and compare outputs. Catch quality drift before users notice.
10. Build feedback loops — user corrections flow back to eval scores, which inform prompt improvements, which are tested against the regression suite.
Common anti-patterns
Anti-pattern 1: Logging everything, analyzing nothing. 10 million traces in Langfuse mean nothing if nobody looks at them. Set up automated analysis and alerts, not just storage.
Anti-pattern 2: Measuring latency but not quality. A fast, wrong answer is worse than a slow, right answer. Always pair performance metrics with quality metrics.
Anti-pattern 3: Human evaluation as a one-time event. Quality drifts as models update, prompts change, and user behavior evolves. Evaluation must be continuous, not periodic.
Anti-pattern 4: Separate observability for agents and infrastructure. Your agent's latency spike might be caused by a database slow query, not an LLM issue. Use correlation IDs that span your entire stack.
The bottom line
Agent observability isn't optional infrastructure — it's the thing that determines whether you can actually maintain your agent in production. The difference between "we shipped an agent" and "we run an agent reliably" is entirely in the observability stack.
Start with traces. Add aggregates. Layer in evals. That's the order. Check our Observability category for scored tools to build your stack.