#1
Base Model
claude-opus-4-7
79.4%
Pass^3
Provider
Thinking
Off
Pass@3
87.3%
Tasks
50/63
Date
2026-04-20
Total Cost
$60.39
Avg Cost
$0.320
All benchmark submissions are ranked by Pass^3, with Pass@3 shown alongside it for the same three-attempt runs.
Two headline scores are used in this benchmark to evaluate model performance.
Pass^3 measures consistency across three runs: a task only counts as a pass if the model passes all three attempts.
Pass@3 measures best-of-three success: a task counts as a pass if the model solves it at least once across three attempts.
Failed attempts are scored as 0 rather than treated as missing, including agent timeouts, non-zero exits, and other run-level exceptions captured in the trial metadata. Each task has a 10-minute timeout per attempt.
#1
Base Model
79.4%
Pass^3
Provider
Thinking
Off
Pass@3
87.3%
Tasks
50/63
Date
2026-04-20
Total Cost
$60.39
Avg Cost
$0.320
#2
Base Model
73.0%
Pass^3
Provider
Thinking
High
Pass@3
90.5%
Tasks
46/63
Date
2026-04-21
Total Cost
$74.34
Avg Cost
$0.393
#3
Base Model
68.3%
Pass^3
Provider
Thinking
High
Pass@3
84.1%
Tasks
43/63
Date
2026-04-21
Total Cost
$36.82
Avg Cost
$0.195
#4
Base Model
66.7%
Pass^3
Provider
Thinking
Off
Pass@3
90.5%
Tasks
42/63
Date
2026-04-21
Total Cost
$53.19
Avg Cost
$0.281
#5
Base Model
63.5%
Pass^3
Provider
Thinking
Low
Pass@3
85.7%
Tasks
40/63
Date
2026-04-21
Total Cost
$43.20
Avg Cost
$0.229
#6
Base Model
63.5%
Pass^3
Provider
Thinking
High
Pass@3
82.5%
Tasks
40/63
Date
2026-04-21
Total Cost
$34.38
Avg Cost
$0.182
#7
Base Model
61.9%
Pass^3
Provider
Thinking
High
Pass@3
84.1%
Tasks
39/63
Date
2026-04-21
Total Cost
$28.86
Avg Cost
$0.153
#8
Base Model
61.9%
Pass^3
Provider
Thinking
High
Pass@3
84.1%
Tasks
39/63
Date
2026-04-20
Total Cost
$13.21
Avg Cost
$0.070
#9
Base Model
58.7%
Pass^3
Provider
Thinking
Off
Pass@3
82.5%
Tasks
37/63
Date
2026-04-21
Total Cost
$14.57
Avg Cost
$0.077
#10
Base Model
58.7%
Pass^3
Provider
Thinking
Off
Pass@3
79.4%
Tasks
37/63
Date
2026-04-21
Total Cost
$26.80
Avg Cost
$0.142
#11
Base Model
58.7%
Pass^3
Provider
Thinking
High
Pass@3
76.2%
Tasks
37/63
Date
2026-04-21
Total Cost
$68.91
Avg Cost
$0.365
#12
Base Model
57.1%
Pass^3
Provider
Thinking
Low
Pass@3
76.2%
Tasks
36/63
Date
2026-04-21
Total Cost
$20.50
Avg Cost
$0.108
#13
Base Model
55.6%
Pass^3
Provider
Thinking
Low
Pass@3
82.5%
Tasks
35/63
Date
2026-04-21
Total Cost
$12.49
Avg Cost
$0.066
#14
Base Model
55.6%
Pass^3
Provider
Thinking
Off
Pass@3
79.4%
Tasks
35/63
Date
2026-04-28
Total Cost
$15.68 *
Avg Cost
$0.083 *
#15
Base Model
54.0%
Pass^3
Provider
Thinking
Off
Pass@3
76.2%
Tasks
34/63
Date
2026-04-20
Total Cost
$33.83
Avg Cost
$0.179
#16
Base Model
50.8%
Pass^3
Provider
Thinking
Off
Pass@3
81.0%
Tasks
32/63
Date
2026-04-28
Total Cost
$8.63 *
Avg Cost
$0.046 *
#17
Base Model
50.8%
Pass^3
Provider
Thinking
Low
Pass@3
76.2%
Tasks
32/63
Date
2026-04-21
Total Cost
$25.92
Avg Cost
$0.137
#18
Base Model
49.2%
Pass^3
Provider
Thinking
Low
Pass@3
69.8%
Tasks
31/63
Date
2026-04-21
Total Cost
$37.61
Avg Cost
$0.199
#19
Base Model
47.6%
Pass^3
Provider
Thinking
High
Pass@3
76.2%
Tasks
30/63
Date
2026-04-21
Total Cost
$5.31
Avg Cost
$0.028
#20
Base Model
47.6%
Pass^3
Provider
Thinking
Off
Pass@3
58.7%
Tasks
30/63
Date
2026-04-20
Total Cost
$21.56
Avg Cost
$0.114
#21
Base Model
46.0%
Pass^3
Provider
Thinking
Low
Pass@3
84.1%
Tasks
29/63
Date
2026-04-21
Total Cost
$7.15
Avg Cost
$0.038
#22
Base Model
46.0%
Pass^3
Provider
Thinking
High
Pass@3
74.6%
Tasks
29/63
Date
2026-04-20
Total Cost
$3.18
Avg Cost
$0.017
#23
Base Model
46.0%
Pass^3
Provider
Thinking
Off
Pass@3
71.4%
Tasks
29/63
Date
2026-04-21
Total Cost
$2.72
Avg Cost
$0.014
#24
Base Model
44.4%
Pass^3
Provider
Thinking
Off
Pass@3
76.2%
Tasks
28/63
Date
2026-04-21
Total Cost
$11.89
Avg Cost
$0.063
#25
Base Model
39.7%
Pass^3
Provider
Thinking
High
Pass@3
66.7%
Tasks
25/63
Date
2026-04-21
Total Cost
$14.04
Avg Cost
$0.074
#26
Base Model
34.9%
Pass^3
Provider
Thinking
High
Pass@3
71.4%
Tasks
22/63
Date
2026-04-21
Total Cost
$5.25
Avg Cost
$0.028
#27
Base Model
34.9%
Pass^3
Provider
Thinking
Low
Pass@3
61.9%
Tasks
22/63
Date
2026-04-21
Total Cost
$14.01
Avg Cost
$0.074
#28
Base Model
33.3%
Pass^3
Provider
Thinking
Low
Pass@3
66.7%
Tasks
21/63
Date
2026-04-21
Total Cost
$4.79
Avg Cost
$0.025
#29
Base Model
33.3%
Pass^3
Provider
Thinking
Off
Pass@3
58.7%
Tasks
21/63
Date
2026-04-21
Total Cost
$3.61
Avg Cost
$0.019
#30
Base Model
31.7%
Pass^3
Provider
Thinking
Off
Pass@3
61.9%
Tasks
20/63
Date
2026-04-28
Total Cost
$57.79
Avg Cost
$0.306
#31
Base Model
28.1%
Pass^3
Provider
Thinking
High
Pass@3
78.9%
Tasks
16/57
Date
2026-04-21
Total Cost
$7.96
Avg Cost
$0.047
#32
Base Model
25.4%
Pass^3
Provider
Thinking
Off
Pass@3
44.4%
Tasks
16/63
Date
2026-04-21
Total Cost
$2.83
Avg Cost
$0.015
#33
Base Model
20.6%
Pass^3
Provider
Thinking
Low
Pass@3
61.9%
Tasks
13/63
Date
2026-04-21
Total Cost
$2.89
Avg Cost
$0.015
#34
Base Model
19.0%
Pass^3
Provider
Thinking
Low
Pass@3
50.8%
Tasks
12/63
Date
2026-04-21
Total Cost
$1.33
Avg Cost
$0.007
#35
Base Model
11.1%
Pass^3
Provider
Thinking
Off
Pass@3
22.2%
Tasks
7/63
Date
2026-04-21
Total Cost
$0.85
Avg Cost
$0.005
| # | Agent | Model | Provider | Thinking | Tasks | Date | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Base Model | claude-opus-4-7 | | Off | 79.4% | 87.3% | 50/63 | $60.39 | $0.320 | 2026-04-20 |
| 2 | Base Model | claude-opus-4-7 | | High | 73.0% | 90.5% | 46/63 | $74.34 | $0.393 | 2026-04-21 |
| 3 | Base Model | claude-sonnet-4-6 | | High | 68.3% | 84.1% | 43/63 | $36.82 | $0.195 | 2026-04-21 |
| 4 | Base Model | claude-opus-4-6 | | Off | 66.7% | 90.5% | 42/63 | $53.19 | $0.281 | 2026-04-21 |
| 5 | Base Model | claude-opus-4-7 | | Low | 63.5% | 85.7% | 40/63 | $43.20 | $0.229 | 2026-04-21 |
| 6 | Base Model | gemini-3.1-pro-preview | | High | 63.5% | 82.5% | 40/63 | $34.38 | $0.182 | 2026-04-21 |
| 7 | Base Model | gpt-5.4-2026-03-05 | | High | 61.9% | 84.1% | 39/63 | $28.86 | $0.153 | 2026-04-21 |
| 8 | Base Model | gemini-3-flash-preview | | High | 61.9% | 84.1% | 39/63 | $13.21 | $0.070 | 2026-04-20 |
| 9 | Base Model | gemini-3-flash-preview | | Off | 58.7% | 82.5% | 37/63 | $14.57 | $0.077 | 2026-04-21 |
| 10 | Base Model | claude-sonnet-4-6 | | Off | 58.7% | 79.4% | 37/63 | $26.80 | $0.142 | 2026-04-21 |
| 11 | Base Model | claude-opus-4-6 | | High | 58.7% | 76.2% | 37/63 | $68.91 | $0.365 | 2026-04-21 |
| 12 | Base Model | gemini-3.1-pro-preview | | Low | 57.1% | 76.2% | 36/63 | $20.50 | $0.108 | 2026-04-21 |
| 13 | Base Model | gpt-5.4-2026-03-05 | | Low | 55.6% | 82.5% | 35/63 | $12.49 | $0.066 | 2026-04-21 |
| 14 | Base Model | qwen/qwen3.6-plus | | Off | 55.6% | 79.4% | 35/63 | $15.68 * | $0.083 * | 2026-04-28 |
| 15 | Base Model | gemini-3.1-pro-preview | | Off | 54.0% | 76.2% | 34/63 | $33.83 | $0.179 | 2026-04-20 |
| 16 | Base Model | moonshotai/kimi-k2.5 | | Off | 50.8% | 81.0% | 32/63 | $8.63 * | $0.046 * | 2026-04-28 |
| 17 | Base Model | claude-sonnet-4-6 | | Low | 50.8% | 76.2% | 32/63 | $25.92 | $0.137 | 2026-04-21 |
| 18 | Base Model | claude-opus-4-6 | | Low | 49.2% | 69.8% | 31/63 | $37.61 | $0.199 | 2026-04-21 |
| 19 | Base Model | gpt-5.4-nano | | High | 47.6% | 76.2% | 30/63 | $5.31 | $0.028 | 2026-04-21 |
| 20 | Base Model | claude-haiku-4-5-20251001 | | Off | 47.6% | 58.7% | 30/63 | $21.56 | $0.114 | 2026-04-20 |
| 21 | Base Model | gemini-3-flash-preview | | Low | 46.0% | 84.1% | 29/63 | $7.15 | $0.038 | 2026-04-21 |
| 22 | Base Model | gpt-5.1-codex-mini | | High | 46.0% | 74.6% | 29/63 | $3.18 | $0.017 | 2026-04-20 |
| 23 | Base Model | gpt-5.1-codex-mini | | Off | 46.0% | 71.4% | 29/63 | $2.72 | $0.014 | 2026-04-21 |
| 24 | Base Model | gpt-5.4-2026-03-05 | | Off | 44.4% | 76.2% | 28/63 | $11.89 | $0.063 | 2026-04-21 |
| 25 | Base Model | claude-haiku-4-5-20251001 | | High | 39.7% | 66.7% | 25/63 | $14.04 | $0.074 | 2026-04-21 |
| 26 | Base Model | gemini-3.1-flash-lite-preview | | High | 34.9% | 71.4% | 22/63 | $5.25 | $0.028 | 2026-04-21 |
| 27 | Base Model | claude-haiku-4-5-20251001 | | Low | 34.9% | 61.9% | 22/63 | $14.01 | $0.074 | 2026-04-21 |
| 28 | Base Model | gemini-3.1-flash-lite-preview | | Low | 33.3% | 66.7% | 21/63 | $4.79 | $0.025 | 2026-04-21 |
| 29 | Base Model | gemini-3.1-flash-lite-preview | | Off | 33.3% | 58.7% | 21/63 | $3.61 | $0.019 | 2026-04-21 |
| 30 | Base Model | x-ai/grok-4.20 | | Off | 31.7% | 61.9% | 20/63 | $57.79 | $0.306 | 2026-04-28 |
| 31 | Base Model | gpt-5.4-mini | | High | 28.1% | 78.9% | 16/57 | $7.96 | $0.047 | 2026-04-21 |
| 32 | Base Model | gpt-5.4-mini | | Off | 25.4% | 44.4% | 16/63 | $2.83 | $0.015 | 2026-04-21 |
| 33 | Base Model | gpt-5.4-mini | | Low | 20.6% | 61.9% | 13/63 | $2.89 | $0.015 | 2026-04-21 |
| 34 | Base Model | gpt-5.4-nano | | Low | 19.0% | 50.8% | 12/63 | $1.33 | $0.007 | 2026-04-21 |
| 35 | Base Model | gpt-5.4-nano | | Off | 11.1% | 22.2% | 7/63 | $0.85 | $0.005 | 2026-04-21 |
* Price was computed in OpenRouter and can vary for OSS models.
Open-source models were run via OpenRouter.