Leaderboard

All benchmark submissions are ranked by Pass^3, with Pass@3 shown alongside it for the same three-attempt runs.

Two headline scores are used in this benchmark to evaluate model performance.

Pass^3 measures consistency across three runs: a task only counts as a pass if the model passes all three attempts.

Pass@3 measures best-of-three success: a task counts as a pass if the model solves it at least once across three attempts.

Failed attempts are scored as 0 rather than treated as missing, including agent timeouts, non-zero exits, and other run-level exceptions captured in the trial metadata. Each task has a 10-minute timeout per attempt.

Thinking
Source

#1

Base Model

Anthropic logo

claude-opus-4-7

79.4%

Pass^3

Thinking

Off

Pass@3

87.3%

Tasks

50/63

Date

2026-04-20

Total Cost

$60.39

Avg Cost

$0.320

Avg Tokens

157k

#2

Base Model

Anthropic logo

claude-opus-4-7

73.0%

Pass^3

Thinking

High

Pass@3

90.5%

Tasks

46/63

Date

2026-04-21

Total Cost

$74.34

Avg Cost

$0.393

Avg Tokens

170k

#3

Base Model

Anthropic logo

claude-sonnet-4-6

68.3%

Pass^3

Thinking

High

Pass@3

84.1%

Tasks

43/63

Date

2026-04-21

Total Cost

$36.82

Avg Cost

$0.195

Avg Tokens

119k

#4

Base Model

Anthropic logo

claude-opus-4-6

66.7%

Pass^3

Thinking

Off

Pass@3

90.5%

Tasks

42/63

Date

2026-04-21

Total Cost

$53.19

Avg Cost

$0.281

Avg Tokens

131k

#5

Base Model

OpenAI logo

gpt-5.4-2026-03-05

64.9%

Pass^3

Thinking

High

Pass@3

87.7%

Tasks

37/57

Date

2026-04-21

Total Cost

$24.59

Avg Cost

$0.146

Avg Tokens

119k

#6

Base Model

Anthropic logo

claude-opus-4-7

63.5%

Pass^3

Thinking

Low

Pass@3

85.7%

Tasks

40/63

Date

2026-04-21

Total Cost

$43.20

Avg Cost

$0.229

Avg Tokens

128k

#7

Base Model

Google logo

gemini-3.1-pro-preview

63.5%

Pass^3

Thinking

High

Pass@3

82.5%

Tasks

40/63

Date

2026-04-21

Total Cost

$34.38

Avg Cost

$0.182

Avg Tokens

290k

#8

Base Model

Google logo

gemini-3-flash-preview

61.9%

Pass^3

Thinking

High

Pass@3

84.1%

Tasks

39/63

Date

2026-04-20

Total Cost

$13.21

Avg Cost

$0.070

Avg Tokens

410k

#9

Base Model

Google logo

gemini-3-flash-preview

58.7%

Pass^3

Thinking

Off

Pass@3

82.5%

Tasks

37/63

Date

2026-04-21

Total Cost

$14.57

Avg Cost

$0.077

Avg Tokens

418k

#10

Base Model

Anthropic logo

claude-sonnet-4-6

58.7%

Pass^3

Thinking

Off

Pass@3

79.4%

Tasks

37/63

Date

2026-04-21

Total Cost

$26.80

Avg Cost

$0.142

Avg Tokens

114k

#11

Base Model

Anthropic logo

claude-opus-4-6

58.7%

Pass^3

Thinking

High

Pass@3

76.2%

Tasks

37/63

Date

2026-04-21

Total Cost

$68.91

Avg Cost

$0.365

Avg Tokens

153k

#12

Base Model

Google logo

gemini-3.1-pro-preview

57.1%

Pass^3

Thinking

Low

Pass@3

76.2%

Tasks

36/63

Date

2026-04-21

Total Cost

$20.50

Avg Cost

$0.108

Avg Tokens

154k

#13

Base Model

OpenAI logo

gpt-5.4-2026-03-05

55.6%

Pass^3

Thinking

Low

Pass@3

82.5%

Tasks

35/63

Date

2026-04-21

Total Cost

$12.49

Avg Cost

$0.066

Avg Tokens

75.1k

#14

Base Model

Qwen logo

qwen/qwen3.6-plus Open

55.6%

Pass^3

Thinking

Off

Pass@3

79.4%

Tasks

35/63

Date

2026-04-28

Total Cost

$15.68 *

Avg Cost

$0.083 *

Avg Tokens

0

#15

Base Model

Google logo

gemini-3.1-pro-preview

54.0%

Pass^3

Thinking

Off

Pass@3

76.2%

Tasks

34/63

Date

2026-04-20

Total Cost

$33.83

Avg Cost

$0.179

Avg Tokens

291k

#16

Base Model

Qwen logo

qwen3.6-27b@bf16 Open

52.7%

Pass^3

Thinking

Off

Pass@3

85.5%

Tasks

29/55

Date

2026-05-08

Total Cost

$0.00 *

Avg Cost

$0.000 *

Avg Tokens

0

#17

Base Model

Moonshot AI logo

moonshotai/kimi-k2.5 Open

50.8%

Pass^3

Thinking

Off

Pass@3

81.0%

Tasks

32/63

Date

2026-04-28

Total Cost

$8.63 *

Avg Cost

$0.046 *

Avg Tokens

90.0k

#18

Base Model

Anthropic logo

claude-sonnet-4-6

50.8%

Pass^3

Thinking

Low

Pass@3

76.2%

Tasks

32/63

Date

2026-04-21

Total Cost

$25.92

Avg Cost

$0.137

Avg Tokens

108k

#19

Base Model

Anthropic logo

claude-opus-4-6

49.2%

Pass^3

Thinking

Low

Pass@3

69.8%

Tasks

31/63

Date

2026-04-21

Total Cost

$37.61

Avg Cost

$0.199

Avg Tokens

117k

#20

Base Model

OpenAI logo

gpt-5.4-nano

47.6%

Pass^3

Thinking

High

Pass@3

76.2%

Tasks

30/63

Date

2026-04-21

Total Cost

$5.31

Avg Cost

$0.028

Avg Tokens

283k

#21

Base Model

Anthropic logo

claude-haiku-4-5-20251001

47.6%

Pass^3

Thinking

Off

Pass@3

58.7%

Tasks

30/63

Date

2026-04-20

Total Cost

$21.56

Avg Cost

$0.114

Avg Tokens

199k

#22

Base Model

Google logo

gemini-3-flash-preview

46.0%

Pass^3

Thinking

Low

Pass@3

84.1%

Tasks

29/63

Date

2026-04-21

Total Cost

$7.15

Avg Cost

$0.038

Avg Tokens

195k

#23

Base Model

OpenAI logo

gpt-5.1-codex-mini

46.0%

Pass^3

Thinking

High

Pass@3

74.6%

Tasks

29/63

Date

2026-04-20

Total Cost

$3.18

Avg Cost

$0.017

Avg Tokens

114k

#24

Base Model

OpenAI logo

gpt-5.1-codex-mini

46.0%

Pass^3

Thinking

Off

Pass@3

71.4%

Tasks

29/63

Date

2026-04-21

Total Cost

$2.72

Avg Cost

$0.014

Avg Tokens

113k

#25

Base Model

OpenAI logo

gpt-5.4-2026-03-05

44.4%

Pass^3

Thinking

Off

Pass@3

76.2%

Tasks

28/63

Date

2026-04-21

Total Cost

$11.89

Avg Cost

$0.063

Avg Tokens

79.5k

#26

Base Model

Anthropic logo

claude-haiku-4-5-20251001

39.7%

Pass^3

Thinking

High

Pass@3

66.7%

Tasks

25/63

Date

2026-04-21

Total Cost

$14.04

Avg Cost

$0.074

Avg Tokens

151k

#27

Base Model

Google logo

gemma-4-31b-it@q8_k_xl Open

38.1%

Pass^3

Thinking

Off

Pass@3

71.4%

Tasks

24/63

Date

2026-05-09

Total Cost

$0.00 *

Avg Cost

$0.000 *

Avg Tokens

0

#28

Base Model

MiniMax logo

minimax-m2.7 Open

36.5%

Pass^3

Thinking

High

Pass@3

73.0%

Tasks

23/63

Date

2026-04-25

Total Cost

$0.00 *

Avg Cost

$0.000 *

Avg Tokens

0

#29

Base Model

Google logo

gemini-3.1-flash-lite-preview

34.9%

Pass^3

Thinking

High

Pass@3

71.4%

Tasks

22/63

Date

2026-04-21

Total Cost

$5.25

Avg Cost

$0.028

Avg Tokens

239k

#30

Base Model

Anthropic logo

claude-haiku-4-5-20251001

34.9%

Pass^3

Thinking

Low

Pass@3

61.9%

Tasks

22/63

Date

2026-04-21

Total Cost

$14.01

Avg Cost

$0.074

Avg Tokens

146k

#31

Base Model

Google logo

gemini-3.1-flash-lite-preview

33.3%

Pass^3

Thinking

Low

Pass@3

66.7%

Tasks

21/63

Date

2026-04-21

Total Cost

$4.79

Avg Cost

$0.025

Avg Tokens

224k

#32

Base Model

Google logo

gemini-3.1-flash-lite-preview

33.3%

Pass^3

Thinking

Off

Pass@3

58.7%

Tasks

21/63

Date

2026-04-21

Total Cost

$3.61

Avg Cost

$0.019

Avg Tokens

199k

#33

Base Model

xAI logo

x-ai/grok-4.20

31.7%

Pass^3

Thinking

Off

Pass@3

61.9%

Tasks

20/63

Date

2026-04-28

Total Cost

$57.79

Avg Cost

$0.306

Avg Tokens

0

#34

Base Model

OpenAI logo

gpt-5.4-mini

28.6%

Pass^3

Thinking

High

Pass@3

76.2%

Tasks

18/63

Date

2026-04-21

Total Cost

$9.02

Avg Cost

$0.048

Avg Tokens

112k

#35

Base Model

Qwen logo

qwen3.5-9b Open

25.4%

Pass^3

Thinking

Off

Pass@3

61.9%

Tasks

16/63

Date

2026-05-08

Total Cost

$0.00 *

Avg Cost

$0.000 *

Avg Tokens

0

#36

Base Model

OpenAI logo

gpt-5.4-mini

25.4%

Pass^3

Thinking

Off

Pass@3

44.4%

Tasks

16/63

Date

2026-04-21

Total Cost

$2.83

Avg Cost

$0.015

Avg Tokens

73.0k

#37

Base Model

OpenAI logo

gpt-5.4-mini

20.6%

Pass^3

Thinking

Low

Pass@3

61.9%

Tasks

13/63

Date

2026-04-21

Total Cost

$2.89

Avg Cost

$0.015

Avg Tokens

61.3k

#38

Base Model

OpenAI logo

gpt-5.4-nano

19.0%

Pass^3

Thinking

Low

Pass@3

50.8%

Tasks

12/63

Date

2026-04-21

Total Cost

$1.33

Avg Cost

$0.007

Avg Tokens

111k

#39

Base Model

OpenAI logo

gpt-5.4-nano

11.1%

Pass^3

Thinking

Off

Pass@3

22.2%

Tasks

7/63

Date

2026-04-21

Total Cost

$0.85

Avg Cost

$0.005

Avg Tokens

79.9k

* Price can vary for open-weight models; self-hosted runs may report $0 when local serving cost is not captured.

Open-source models were run via OpenRouter, and some models were also run on self-hosted infrastructure.