Leaderboard

All benchmark submissions are ranked by Pass^3, with Pass@3 shown alongside it for three-attempt runs.

Two headline scores are used in this benchmark to evaluate model performance.

Pass^3 measures consistency across three runs: a task only counts if the model passes all three attempts.

Pass@3 measures best-of-three success: a task counts if the model solves it at least once across three attempts.

Failed attempts are scored as 0 rather than treated as missing, including agent timeouts, non-zero exits, and other run-level exceptions captured in the trial metadata. Each task has a 10-minute timeout per attempt.

Thinking

#1

gemini-3.1-pro-preview

82.5%

Pass^3

Provider

Google logo Google

Thinking

High

Pass@3

95.2%

Tasks

52/63

Date

2026-04-15

Total Cost

$46.13

Avg Cost

$0.244

#2

gpt-5.4-2026-03-05

82.5%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

High

Pass@3

92.1%

Tasks

52/63

Date

2026-04-14

Total Cost

$32.30

Avg Cost

$0.171

#3

claude-opus-4-6

81.0%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

High

Pass@3

92.1%

Tasks

51/63

Date

2026-04-14

Total Cost

$82.85

Avg Cost

$0.438

#4

claude-opus-4-6

77.8%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Off

Pass@3

93.7%

Tasks

49/63

Date

2026-04-14

Total Cost

$85.91

Avg Cost

$0.455

#5

claude-sonnet-4-6

77.8%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

High

Pass@3

90.5%

Tasks

49/63

Date

2026-04-14

Total Cost

$46.73

Avg Cost

$0.247

#6

gemini-3.1-pro-preview

77.8%

Pass^3

Provider

Google logo Google

Thinking

Low

Pass@3

93.7%

Tasks

49/63

Date

2026-04-14

Total Cost

$27.51

Avg Cost

$0.146

#7

claude-opus-4-6

74.6%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Low

Pass@3

85.7%

Tasks

47/63

Date

2026-04-14

Total Cost

$41.19

Avg Cost

$0.218

#8

gemini-3.1-pro-preview

74.6%

Pass^3

Provider

Google logo Google

Thinking

Off

Pass@3

92.1%

Tasks

47/63

Date

2026-04-14

Total Cost

$38.76

Avg Cost

$0.205

#9

gpt-5.4-2026-03-05

73.0%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Low

Pass@3

90.5%

Tasks

46/63

Date

2026-04-14

Total Cost

$13.86

Avg Cost

$0.073

#10

qwen/qwen3.6-plus

73.0%

Pass^3

Provider

Qwen logo Qwen

Thinking

Off

Pass@3

87.3%

Tasks

46/63

Date

2026-04-15

Total Cost

$0.00

Avg Cost

$0.000

#11

claude-sonnet-4-6

71.4%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Low

Pass@3

88.9%

Tasks

45/63

Date

2026-04-14

Total Cost

$31.88

Avg Cost

$0.169

#12

deepseek/deepseek-v3.2

71.4%

Pass^3

Provider

DeepSeek logo DeepSeek

Thinking

Off

Pass@3

92.1%

Tasks

45/63

Date

2026-04-15

Total Cost

$0.00

Avg Cost

$0.000

#13

claude-sonnet-4-6

66.7%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Off

Pass@3

85.7%

Tasks

42/63

Date

2026-04-14

Total Cost

$38.23

Avg Cost

$0.202

#14

gemini-3.1-flash-lite-preview

61.9%

Pass^3

Provider

Google logo Google

Thinking

High

Pass@3

85.7%

Tasks

39/63

Date

2026-04-15

Total Cost

$5.01

Avg Cost

$0.027

#15

gpt-5.4-2026-03-05

61.9%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Off

Pass@3

85.7%

Tasks

39/63

Date

2026-04-14

Total Cost

$12.20

Avg Cost

$0.065

#16

gpt-5.4-nano

61.9%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

High

Pass@3

90.5%

Tasks

39/63

Date

2026-04-14

Total Cost

$6.08

Avg Cost

$0.032

#17

gemini-3.1-flash-lite-preview

60.3%

Pass^3

Provider

Google logo Google

Thinking

Low

Pass@3

88.9%

Tasks

38/63

Date

2026-04-15

Total Cost

$4.66

Avg Cost

$0.025

#18

moonshotai/kimi-k2.5

60.3%

Pass^3

Provider

Moonshot AI logo Moonshot AI

Thinking

Off

Pass@3

90.5%

Tasks

38/63

Date

2026-04-14

Total Cost

$0.00

Avg Cost

$0.000

#19

claude-haiku-4-5-20251001

58.7%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Off

Pass@3

69.8%

Tasks

37/63

Date

2026-04-13

Total Cost

$13.33

Avg Cost

$0.071

#20

gpt-5.1-codex-mini

57.1%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

High

Pass@3

82.5%

Tasks

36/63

Date

2026-04-13

Total Cost

$4.86

Avg Cost

$0.026

#21

gpt-5.4-mini

55.6%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

High

Pass@3

90.5%

Tasks

35/63

Date

2026-04-14

Total Cost

$9.65

Avg Cost

$0.051

#22

claude-haiku-4-5-20251001

52.4%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

High

Pass@3

76.2%

Tasks

33/63

Date

2026-04-13

Total Cost

$11.27

Avg Cost

$0.060

#23

gpt-5.1-codex-mini

50.8%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Off

Pass@3

87.3%

Tasks

32/63

Date

2026-04-13

Total Cost

$3.86

Avg Cost

$0.020

#24

gemini-3.1-flash-lite-preview

47.6%

Pass^3

Provider

Google logo Google

Thinking

Off

Pass@3

79.4%

Tasks

30/63

Date

2026-04-15

Total Cost

$4.18

Avg Cost

$0.022

#25

claude-haiku-4-5-20251001

46.0%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Low

Pass@3

69.8%

Tasks

29/63

Date

2026-04-13

Total Cost

$11.39

Avg Cost

$0.060

#26

gpt-5.4-mini

42.9%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Low

Pass@3

69.8%

Tasks

27/63

Date

2026-04-14

Total Cost

$2.82

Avg Cost

$0.015

#27

gpt-5.4-mini

39.7%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Off

Pass@3

63.5%

Tasks

25/63

Date

2026-04-14

Total Cost

$2.80

Avg Cost

$0.015

#28

gpt-5.4-nano

38.1%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Low

Pass@3

66.7%

Tasks

24/63

Date

2026-04-14

Total Cost

$1.42

Avg Cost

$0.008

#29

gpt-5.4-nano

23.8%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Off

Pass@3

38.1%

Tasks

15/63

Date

2026-04-14

Total Cost

$0.81

Avg Cost

$0.004

Open-source models were run via OpenRouter.