EntropyMath Leaderboard

A high-entropy mathematical reasoning benchmark for LLMs

Model Accuracy vs Pass@3

100%

75%

50%

25%

Gemini-3-Pro-Preview

GPT-5.2 (high)

Solar Pro 3 (Round 2)

Kanana-2-30B-Thinking-2601

K-EXAONE-236B-A23B

Accuracy

Pass@3

Avg Token Usage (Per Problem)

20.7K

15.5K

10.3K

5.2K

K-EXAONE-236B-A23B

Solar Pro 3 (Round 2)

Gemini-3-Pro-Preview

Kanana-2-30B-Thinking-2601

GPT-5.2 (high)

Avg Tokens / Problem

EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs.

Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.

Performance Legend

Mastery (100%)

3/3

Strong (66%)

2/3

Weak (33%)

1/3

Fail (0%)

0/3

Model	Acc	Pass@3	0	1	2	3	4	5	6	7	8	9	10	11
API / Others
Gemini-3-Pro-Preview	100.0	100.0	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1
GPT-5.2 (high)	83.3	83.3	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	0/1	1/1
K-LLM Project Round 2
Solar Pro 3 (Round 2)	75.0	83.3	3/3	3/3	3/3	3/3	0/3	2/3	3/3	0/3	2/3	3/3	2/3	3/3
K-EXAONE-236B-A23B	58.3	75.0	3/3	2/3	3/3	3/3	1/3	0/3	3/3	0/3	3/3	0/3	1/3	2/3
Local - KR
Kanana-2-30B-Thinking-2601	61.1	83.3	3/3	3/3	3/3	0/3	1/3	3/3	3/3	0/3	1/3	2/3	1/3	2/3