#1
Base Model
claude-opus-4-7
79.4%
Pass^3
Provider
Thinking
Off
Pass@3
87.3%
Tasks
50/63
Date
2026-04-20
Total Cost
$60.39
Avg Cost
$0.320
o11y-bench
A standardized evaluation suite measuring how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.
#1
Base Model
79.4%
Pass^3
Provider
Thinking
Off
Pass@3
87.3%
Tasks
50/63
Date
2026-04-20
Total Cost
$60.39
Avg Cost
$0.320
#2
Base Model
73.0%
Pass^3
Provider
Thinking
High
Pass@3
90.5%
Tasks
46/63
Date
2026-04-21
Total Cost
$74.34
Avg Cost
$0.393
#3
Base Model
68.3%
Pass^3
Provider
Thinking
High
Pass@3
84.1%
Tasks
43/63
Date
2026-04-21
Total Cost
$36.82
Avg Cost
$0.195
#4
Base Model
66.7%
Pass^3
Provider
Thinking
Off
Pass@3
90.5%
Tasks
42/63
Date
2026-04-21
Total Cost
$53.19
Avg Cost
$0.281
#5
Base Model
63.5%
Pass^3
Provider
Thinking
Low
Pass@3
85.7%
Tasks
40/63
Date
2026-04-21
Total Cost
$43.20
Avg Cost
$0.229
| # | Agent | Model | Provider | Thinking | Tasks | Date | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Base Model | claude-opus-4-7 | | Off | 79.4% | 87.3% | 50/63 | $60.39 | $0.320 | 2026-04-20 |
| 2 | Base Model | claude-opus-4-7 | | High | 73.0% | 90.5% | 46/63 | $74.34 | $0.393 | 2026-04-21 |
| 3 | Base Model | claude-sonnet-4-6 | | High | 68.3% | 84.1% | 43/63 | $36.82 | $0.195 | 2026-04-21 |
| 4 | Base Model | claude-opus-4-6 | | Off | 66.7% | 90.5% | 42/63 | $53.19 | $0.281 | 2026-04-21 |
| 5 | Base Model | claude-opus-4-7 | | Low | 63.5% | 85.7% | 40/63 | $43.20 | $0.229 | 2026-04-21 |
Category scores use Pass^3 consistency across the three benchmark trials per task. Green is 95%+, yellow is 80%+, and red is below 80%.
Swipe horizontally to compare category scores.
| Model | Dashboards | Grafana API | Investigation | Logs | Metrics | Traces |
|---|---|---|---|---|---|---|
| claude-opus-4-7 | 57% | 100% | 73% | 80% | 88% | 77% |
| claude-opus-4-7 | 43% | 100% | 45% | 80% | 88% | 77% |
| claude-sonnet-4-6 | 29% | 100% | 45% | 50% | 94% | 77% |
| claude-opus-4-6 | 43% | 100% | 45% | 60% | 75% | 77% |
| claude-opus-4-7 | 43% | 100% | 36% | 70% | 69% | 69% |
| gemini-3.1-pro-preview | 43% | 100% | 27% | 50% | 94% | 62% |
| gpt-5.4-2026-03-05 | 29% | 83% | 64% | 40% | 81% | 62% |
| gemini-3-flash-preview | 14% | 100% | 64% | 40% | 81% | 62% |
| gemini-3-flash-preview | 29% | 100% | 36% | 40% | 88% | 54% |
| claude-sonnet-4-6 | 14% | 100% | 55% | 30% | 81% | 62% |