Leaderboard

All benchmark submissions are ranked by Pass^3, with Pass@3 shown alongside it for the same three-attempt runs.

Two headline scores are used in this benchmark to evaluate model performance.

Pass^3 measures consistency across three runs: a task only counts as a pass if the model passes all three attempts.

Pass@3 measures best-of-three success: a task counts as a pass if the model solves it at least once across three attempts.

Failed attempts are scored as 0 rather than treated as missing, including agent timeouts, non-zero exits, and other run-level exceptions captured in the trial metadata. Each task has a 10-minute timeout per attempt.

Thinking
Source

#1

Base Model

claude-opus-4-7

79.4%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Off

Pass@3

87.3%

Tasks

50/63

Date

2026-04-20

Total Cost

$60.39

Avg Cost

$0.320

#2

Base Model

claude-opus-4-7

73.0%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

High

Pass@3

90.5%

Tasks

46/63

Date

2026-04-21

Total Cost

$74.34

Avg Cost

$0.393

#3

Base Model

claude-sonnet-4-6

68.3%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

High

Pass@3

84.1%

Tasks

43/63

Date

2026-04-21

Total Cost

$36.82

Avg Cost

$0.195

#4

Base Model

claude-opus-4-6

66.7%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Off

Pass@3

90.5%

Tasks

42/63

Date

2026-04-21

Total Cost

$53.19

Avg Cost

$0.281

#5

Base Model

claude-opus-4-7

63.5%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Low

Pass@3

85.7%

Tasks

40/63

Date

2026-04-21

Total Cost

$43.20

Avg Cost

$0.229

#6

Base Model

gemini-3.1-pro-preview

63.5%

Pass^3

Provider

Google logo Google

Thinking

High

Pass@3

82.5%

Tasks

40/63

Date

2026-04-21

Total Cost

$34.38

Avg Cost

$0.182

#7

Base Model

gpt-5.4-2026-03-05

61.9%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

High

Pass@3

84.1%

Tasks

39/63

Date

2026-04-21

Total Cost

$28.86

Avg Cost

$0.153

#8

Base Model

gemini-3-flash-preview

61.9%

Pass^3

Provider

Google logo Google

Thinking

High

Pass@3

84.1%

Tasks

39/63

Date

2026-04-20

Total Cost

$13.21

Avg Cost

$0.070

#9

Base Model

gemini-3-flash-preview

58.7%

Pass^3

Provider

Google logo Google

Thinking

Off

Pass@3

82.5%

Tasks

37/63

Date

2026-04-21

Total Cost

$14.57

Avg Cost

$0.077

#10

Base Model

claude-sonnet-4-6

58.7%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Off

Pass@3

79.4%

Tasks

37/63

Date

2026-04-21

Total Cost

$26.80

Avg Cost

$0.142

#11

Base Model

claude-opus-4-6

58.7%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

High

Pass@3

76.2%

Tasks

37/63

Date

2026-04-21

Total Cost

$68.91

Avg Cost

$0.365

#12

Base Model

gemini-3.1-pro-preview

57.1%

Pass^3

Provider

Google logo Google

Thinking

Low

Pass@3

76.2%

Tasks

36/63

Date

2026-04-21

Total Cost

$20.50

Avg Cost

$0.108

#13

Base Model

gpt-5.4-2026-03-05

55.6%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Low

Pass@3

82.5%

Tasks

35/63

Date

2026-04-21

Total Cost

$12.49

Avg Cost

$0.066

#14

Base Model

qwen/qwen3.6-plus

55.6%

Pass^3

Provider

Qwen logo Qwen

Thinking

Off

Pass@3

79.4%

Tasks

35/63

Date

2026-04-28

Total Cost

$15.68 *

Avg Cost

$0.083 *

#15

Base Model

gemini-3.1-pro-preview

54.0%

Pass^3

Provider

Google logo Google

Thinking

Off

Pass@3

76.2%

Tasks

34/63

Date

2026-04-20

Total Cost

$33.83

Avg Cost

$0.179

#16

Base Model

moonshotai/kimi-k2.5

50.8%

Pass^3

Provider

Moonshot AI logo Moonshot AI

Thinking

Off

Pass@3

81.0%

Tasks

32/63

Date

2026-04-28

Total Cost

$8.63 *

Avg Cost

$0.046 *

#17

Base Model

claude-sonnet-4-6

50.8%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Low

Pass@3

76.2%

Tasks

32/63

Date

2026-04-21

Total Cost

$25.92

Avg Cost

$0.137

#18

Base Model

claude-opus-4-6

49.2%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Low

Pass@3

69.8%

Tasks

31/63

Date

2026-04-21

Total Cost

$37.61

Avg Cost

$0.199

#19

Base Model

gpt-5.4-nano

47.6%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

High

Pass@3

76.2%

Tasks

30/63

Date

2026-04-21

Total Cost

$5.31

Avg Cost

$0.028

#20

Base Model

claude-haiku-4-5-20251001

47.6%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Off

Pass@3

58.7%

Tasks

30/63

Date

2026-04-20

Total Cost

$21.56

Avg Cost

$0.114

#21

Base Model

gemini-3-flash-preview

46.0%

Pass^3

Provider

Google logo Google

Thinking

Low

Pass@3

84.1%

Tasks

29/63

Date

2026-04-21

Total Cost

$7.15

Avg Cost

$0.038

#22

Base Model

gpt-5.1-codex-mini

46.0%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

High

Pass@3

74.6%

Tasks

29/63

Date

2026-04-20

Total Cost

$3.18

Avg Cost

$0.017

#23

Base Model

gpt-5.1-codex-mini

46.0%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Off

Pass@3

71.4%

Tasks

29/63

Date

2026-04-21

Total Cost

$2.72

Avg Cost

$0.014

#24

Base Model

gpt-5.4-2026-03-05

44.4%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Off

Pass@3

76.2%

Tasks

28/63

Date

2026-04-21

Total Cost

$11.89

Avg Cost

$0.063

#25

Base Model

claude-haiku-4-5-20251001

39.7%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

High

Pass@3

66.7%

Tasks

25/63

Date

2026-04-21

Total Cost

$14.04

Avg Cost

$0.074

#26

Base Model

gemini-3.1-flash-lite-preview

34.9%

Pass^3

Provider

Google logo Google

Thinking

High

Pass@3

71.4%

Tasks

22/63

Date

2026-04-21

Total Cost

$5.25

Avg Cost

$0.028

#27

Base Model

claude-haiku-4-5-20251001

34.9%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Low

Pass@3

61.9%

Tasks

22/63

Date

2026-04-21

Total Cost

$14.01

Avg Cost

$0.074

#28

Base Model

gemini-3.1-flash-lite-preview

33.3%

Pass^3

Provider

Google logo Google

Thinking

Low

Pass@3

66.7%

Tasks

21/63

Date

2026-04-21

Total Cost

$4.79

Avg Cost

$0.025

#29

Base Model

gemini-3.1-flash-lite-preview

33.3%

Pass^3

Provider

Google logo Google

Thinking

Off

Pass@3

58.7%

Tasks

21/63

Date

2026-04-21

Total Cost

$3.61

Avg Cost

$0.019

#30

Base Model

x-ai/grok-4.20

31.7%

Pass^3

Provider

xAI logo xAI

Thinking

Off

Pass@3

61.9%

Tasks

20/63

Date

2026-04-28

Total Cost

$57.79

Avg Cost

$0.306

#31

Base Model

gpt-5.4-mini

28.1%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

High

Pass@3

78.9%

Tasks

16/57

Date

2026-04-21

Total Cost

$7.96

Avg Cost

$0.047

#32

Base Model

gpt-5.4-mini

25.4%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Off

Pass@3

44.4%

Tasks

16/63

Date

2026-04-21

Total Cost

$2.83

Avg Cost

$0.015

#33

Base Model

gpt-5.4-mini

20.6%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Low

Pass@3

61.9%

Tasks

13/63

Date

2026-04-21

Total Cost

$2.89

Avg Cost

$0.015

#34

Base Model

gpt-5.4-nano

19.0%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Low

Pass@3

50.8%

Tasks

12/63

Date

2026-04-21

Total Cost

$1.33

Avg Cost

$0.007

#35

Base Model

gpt-5.4-nano

11.1%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

Off

Pass@3

22.2%

Tasks

7/63

Date

2026-04-21

Total Cost

$0.85

Avg Cost

$0.005

* Price was computed in OpenRouter and can vary for OSS models.

Open-source models were run via OpenRouter.