o11y-bench

The first observability benchmark for AI agents

A standardized evaluation suite measuring how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.

Top Agents

View all →

#1

Base Model

claude-opus-4-7

79.4%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Off

Pass@3

87.3%

Tasks

50/63

Date

2026-04-20

Total Cost

$60.39

Avg Cost

$0.320

#2

Base Model

claude-opus-4-7

73.0%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

High

Pass@3

90.5%

Tasks

46/63

Date

2026-04-21

Total Cost

$74.34

Avg Cost

$0.393

#3

Base Model

claude-sonnet-4-6

68.3%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

High

Pass@3

84.1%

Tasks

43/63

Date

2026-04-21

Total Cost

$36.82

Avg Cost

$0.195

#4

Base Model

claude-opus-4-6

66.7%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Off

Pass@3

90.5%

Tasks

42/63

Date

2026-04-21

Total Cost

$53.19

Avg Cost

$0.281

#5

Base Model

claude-opus-4-7

63.5%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Low

Pass@3

85.7%

Tasks

40/63

Date

2026-04-21

Total Cost

$43.20

Avg Cost

$0.229

Top 10 By Category

Category scores use Pass^3 consistency across the three benchmark trials per task. Green is 95%+, yellow is 80%+, and red is below 80%.

Swipe horizontally to compare category scores.

Model DashboardsGrafana APIInvestigationLogsMetricsTraces
claude-opus-4-7 57% 100% 73% 80% 88% 77%
claude-opus-4-7 43% 100% 45% 80% 88% 77%
claude-sonnet-4-6 29% 100% 45% 50% 94% 77%
claude-opus-4-6 43% 100% 45% 60% 75% 77%
claude-opus-4-7 43% 100% 36% 70% 69% 69%
gemini-3.1-pro-preview 43% 100% 27% 50% 94% 62%
gpt-5.4-2026-03-05 29% 83% 64% 40% 81% 62%
gemini-3-flash-preview 14% 100% 64% 40% 81% 62%
gemini-3-flash-preview 29% 100% 36% 40% 88% 54%
claude-sonnet-4-6 14% 100% 55% 30% 81% 62%

Featured Tasks

Browse all →