Developer building autonomous log analysis agent
“Functional prototype covering 90% of cases with 200+ test suite; enables fast diagnosis of GPU cluster issues and automatic ticket creation for at-risk clusters”
team
Why they built it
To handle complexity of operating large GPU clusters (cooling, power, networking, fans) using natural language queries on petabytes of telemetry data
What worked
Prompt engineering enabled quick prototype; mixture-of-agents with domain experts; supervisor routing; graphs for trend analysis like Slurm job failures
What broke or was painful
Hallucinations on off-topic questions (added off-topic detection agent); single LLM insufficient for diverse telemetry (switched to multi-LLM MoA)
The result
Functional prototype covering 90% of cases with 200+ test suite; enables fast diagnosis of GPU cluster issues and automatic ticket creation for at-risk clusters