#1
gemini-3.1-pro-preview
82.5%
Pass^3
Provider
Thinking
High
Pass@3
95.2%
Tasks
52/63
Date
2026-04-15
Total Cost
$46.13
Avg Cost
$0.244
All benchmark submissions are ranked by Pass^3, with Pass@3 shown alongside it for three-attempt runs.
Two headline scores are used in this benchmark to evaluate model performance.
Pass^3 measures consistency across three runs: a task only counts if the model passes all three attempts.
Pass@3 measures best-of-three success: a task counts if the model solves it at least once across three attempts.
Failed attempts are scored as 0 rather than treated as missing, including agent timeouts, non-zero exits, and other run-level exceptions captured in the trial metadata. Each task has a 10-minute timeout per attempt.
#1
82.5%
Pass^3
Provider
Thinking
High
Pass@3
95.2%
Tasks
52/63
Date
2026-04-15
Total Cost
$46.13
Avg Cost
$0.244
#2
82.5%
Pass^3
Provider
Thinking
High
Pass@3
92.1%
Tasks
52/63
Date
2026-04-14
Total Cost
$32.30
Avg Cost
$0.171
#3
81.0%
Pass^3
Provider
Thinking
High
Pass@3
92.1%
Tasks
51/63
Date
2026-04-14
Total Cost
$82.85
Avg Cost
$0.438
#4
77.8%
Pass^3
Provider
Thinking
Off
Pass@3
93.7%
Tasks
49/63
Date
2026-04-14
Total Cost
$85.91
Avg Cost
$0.455
#5
77.8%
Pass^3
Provider
Thinking
High
Pass@3
90.5%
Tasks
49/63
Date
2026-04-14
Total Cost
$46.73
Avg Cost
$0.247
#6
77.8%
Pass^3
Provider
Thinking
Low
Pass@3
93.7%
Tasks
49/63
Date
2026-04-14
Total Cost
$27.51
Avg Cost
$0.146
#7
74.6%
Pass^3
Provider
Thinking
Low
Pass@3
85.7%
Tasks
47/63
Date
2026-04-14
Total Cost
$41.19
Avg Cost
$0.218
#8
74.6%
Pass^3
Provider
Thinking
Off
Pass@3
92.1%
Tasks
47/63
Date
2026-04-14
Total Cost
$38.76
Avg Cost
$0.205
#9
73.0%
Pass^3
Provider
Thinking
Low
Pass@3
90.5%
Tasks
46/63
Date
2026-04-14
Total Cost
$13.86
Avg Cost
$0.073
#10
73.0%
Pass^3
Provider
Thinking
Off
Pass@3
87.3%
Tasks
46/63
Date
2026-04-15
Total Cost
$0.00
Avg Cost
$0.000
#11
71.4%
Pass^3
Provider
Thinking
Low
Pass@3
88.9%
Tasks
45/63
Date
2026-04-14
Total Cost
$31.88
Avg Cost
$0.169
#12
71.4%
Pass^3
Provider
Thinking
Off
Pass@3
92.1%
Tasks
45/63
Date
2026-04-15
Total Cost
$0.00
Avg Cost
$0.000
#13
66.7%
Pass^3
Provider
Thinking
Off
Pass@3
85.7%
Tasks
42/63
Date
2026-04-14
Total Cost
$38.23
Avg Cost
$0.202
#14
61.9%
Pass^3
Provider
Thinking
High
Pass@3
85.7%
Tasks
39/63
Date
2026-04-15
Total Cost
$5.01
Avg Cost
$0.027
#15
61.9%
Pass^3
Provider
Thinking
Off
Pass@3
85.7%
Tasks
39/63
Date
2026-04-14
Total Cost
$12.20
Avg Cost
$0.065
#16
61.9%
Pass^3
Provider
Thinking
High
Pass@3
90.5%
Tasks
39/63
Date
2026-04-14
Total Cost
$6.08
Avg Cost
$0.032
#17
60.3%
Pass^3
Provider
Thinking
Low
Pass@3
88.9%
Tasks
38/63
Date
2026-04-15
Total Cost
$4.66
Avg Cost
$0.025
#18
60.3%
Pass^3
Provider
Thinking
Off
Pass@3
90.5%
Tasks
38/63
Date
2026-04-14
Total Cost
$0.00
Avg Cost
$0.000
#19
58.7%
Pass^3
Provider
Thinking
Off
Pass@3
69.8%
Tasks
37/63
Date
2026-04-13
Total Cost
$13.33
Avg Cost
$0.071
#20
57.1%
Pass^3
Provider
Thinking
High
Pass@3
82.5%
Tasks
36/63
Date
2026-04-13
Total Cost
$4.86
Avg Cost
$0.026
#21
55.6%
Pass^3
Provider
Thinking
High
Pass@3
90.5%
Tasks
35/63
Date
2026-04-14
Total Cost
$9.65
Avg Cost
$0.051
#22
52.4%
Pass^3
Provider
Thinking
High
Pass@3
76.2%
Tasks
33/63
Date
2026-04-13
Total Cost
$11.27
Avg Cost
$0.060
#23
50.8%
Pass^3
Provider
Thinking
Off
Pass@3
87.3%
Tasks
32/63
Date
2026-04-13
Total Cost
$3.86
Avg Cost
$0.020
#24
47.6%
Pass^3
Provider
Thinking
Off
Pass@3
79.4%
Tasks
30/63
Date
2026-04-15
Total Cost
$4.18
Avg Cost
$0.022
#25
46.0%
Pass^3
Provider
Thinking
Low
Pass@3
69.8%
Tasks
29/63
Date
2026-04-13
Total Cost
$11.39
Avg Cost
$0.060
#26
42.9%
Pass^3
Provider
Thinking
Low
Pass@3
69.8%
Tasks
27/63
Date
2026-04-14
Total Cost
$2.82
Avg Cost
$0.015
#27
39.7%
Pass^3
Provider
Thinking
Off
Pass@3
63.5%
Tasks
25/63
Date
2026-04-14
Total Cost
$2.80
Avg Cost
$0.015
#28
38.1%
Pass^3
Provider
Thinking
Low
Pass@3
66.7%
Tasks
24/63
Date
2026-04-14
Total Cost
$1.42
Avg Cost
$0.008
#29
23.8%
Pass^3
Provider
Thinking
Off
Pass@3
38.1%
Tasks
15/63
Date
2026-04-14
Total Cost
$0.81
Avg Cost
$0.004
| # | Model | Provider | Thinking | Tasks | Date | ||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | gemini-3.1-pro-preview | | High | 82.5% | 95.2% | 52/63 | $46.13 | $0.244 | 2026-04-15 |
| 2 | gpt-5.4-2026-03-05 | | High | 82.5% | 92.1% | 52/63 | $32.30 | $0.171 | 2026-04-14 |
| 3 | claude-opus-4-6 | | High | 81.0% | 92.1% | 51/63 | $82.85 | $0.438 | 2026-04-14 |
| 4 | claude-opus-4-6 | | Off | 77.8% | 93.7% | 49/63 | $85.91 | $0.455 | 2026-04-14 |
| 5 | claude-sonnet-4-6 | | High | 77.8% | 90.5% | 49/63 | $46.73 | $0.247 | 2026-04-14 |
| 6 | gemini-3.1-pro-preview | | Low | 77.8% | 93.7% | 49/63 | $27.51 | $0.146 | 2026-04-14 |
| 7 | claude-opus-4-6 | | Low | 74.6% | 85.7% | 47/63 | $41.19 | $0.218 | 2026-04-14 |
| 8 | gemini-3.1-pro-preview | | Off | 74.6% | 92.1% | 47/63 | $38.76 | $0.205 | 2026-04-14 |
| 9 | gpt-5.4-2026-03-05 | | Low | 73.0% | 90.5% | 46/63 | $13.86 | $0.073 | 2026-04-14 |
| 10 | qwen/qwen3.6-plus | | Off | 73.0% | 87.3% | 46/63 | $0.00 | $0.000 | 2026-04-15 |
| 11 | claude-sonnet-4-6 | | Low | 71.4% | 88.9% | 45/63 | $31.88 | $0.169 | 2026-04-14 |
| 12 | deepseek/deepseek-v3.2 | | Off | 71.4% | 92.1% | 45/63 | $0.00 | $0.000 | 2026-04-15 |
| 13 | claude-sonnet-4-6 | | Off | 66.7% | 85.7% | 42/63 | $38.23 | $0.202 | 2026-04-14 |
| 14 | gemini-3.1-flash-lite-preview | | High | 61.9% | 85.7% | 39/63 | $5.01 | $0.027 | 2026-04-15 |
| 15 | gpt-5.4-2026-03-05 | | Off | 61.9% | 85.7% | 39/63 | $12.20 | $0.065 | 2026-04-14 |
| 16 | gpt-5.4-nano | | High | 61.9% | 90.5% | 39/63 | $6.08 | $0.032 | 2026-04-14 |
| 17 | gemini-3.1-flash-lite-preview | | Low | 60.3% | 88.9% | 38/63 | $4.66 | $0.025 | 2026-04-15 |
| 18 | moonshotai/kimi-k2.5 | | Off | 60.3% | 90.5% | 38/63 | $0.00 | $0.000 | 2026-04-14 |
| 19 | claude-haiku-4-5-20251001 | | Off | 58.7% | 69.8% | 37/63 | $13.33 | $0.071 | 2026-04-13 |
| 20 | gpt-5.1-codex-mini | | High | 57.1% | 82.5% | 36/63 | $4.86 | $0.026 | 2026-04-13 |
| 21 | gpt-5.4-mini | | High | 55.6% | 90.5% | 35/63 | $9.65 | $0.051 | 2026-04-14 |
| 22 | claude-haiku-4-5-20251001 | | High | 52.4% | 76.2% | 33/63 | $11.27 | $0.060 | 2026-04-13 |
| 23 | gpt-5.1-codex-mini | | Off | 50.8% | 87.3% | 32/63 | $3.86 | $0.020 | 2026-04-13 |
| 24 | gemini-3.1-flash-lite-preview | | Off | 47.6% | 79.4% | 30/63 | $4.18 | $0.022 | 2026-04-15 |
| 25 | claude-haiku-4-5-20251001 | | Low | 46.0% | 69.8% | 29/63 | $11.39 | $0.060 | 2026-04-13 |
| 26 | gpt-5.4-mini | | Low | 42.9% | 69.8% | 27/63 | $2.82 | $0.015 | 2026-04-14 |
| 27 | gpt-5.4-mini | | Off | 39.7% | 63.5% | 25/63 | $2.80 | $0.015 | 2026-04-14 |
| 28 | gpt-5.4-nano | | Low | 38.1% | 66.7% | 24/63 | $1.42 | $0.008 | 2026-04-14 |
| 29 | gpt-5.4-nano | | Off | 23.8% | 38.1% | 15/63 | $0.81 | $0.004 | 2026-04-14 |
Open-source models were run via OpenRouter.