EntropyMath Leaderboard

A high-entropy mathematical reasoning benchmark for LLMs

Korean CSAT (College Scholastic Ability Test) Math Problems

Model Accuracy vs Pass@3

100%

75%

50%

25%

Gemini-3-Pro-Preview

GPT-5.2 (high)

Claude-Opus-4.5

Grok-4.1-fast

GPT-5.1 (high)

Deepseek-V3.2

Solar-Open-100B

K-EXAONE-236B-A23B

Kanana-2-30B-Thinking-2601

Solar-Pro-2 (31B)(high)

Kanana-2-30B-Thinking

HCX-007(high)

EXAONE-4.0.1-32B (high)

A.X-4.0 (72B)

axk1

Llama-VARCO-8B-Instruct

Accuracy

Pass@3

Avg Token Usage (Per Problem)

112K

84K

56K

28K

K-EXAONE-236B-A23B

Grok-4.1-fast

Solar-Open-100B

Kanana-2-30B-Thinking-2601

Kanana-2-30B-Thinking

Deepseek-V3.2

Gemini-3-Pro-Preview

Claude-Opus-4.5

Solar-Pro-2 (31B)(high)

GPT-5.1 (high)

EXAONE-4.0.1-32B (high)

GPT-5.2 (high)

Llama-VARCO-8B-Instruct

HCX-007(high)

A.X-4.0 (72B)

axk1

Avg Tokens / Problem

EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. The EntropyMath_SAT_50 dataset challenges models with SAT-style problems derived from the Korean, Indian, and Japanese College Scholastic Ability Test (CSAT). These problems demand not only high-precision calculation but also deep conceptual understanding and logical inference, representing a significant challenge even for advanced LLMs.

Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.

Performance Legend

Mastery (100%)

3/3

Strong (66%)

2/3

Weak (33%)

1/3

Fail (0%)

0/3

Model	ACC	PASS@3	18-22					PS-26-30					Cal-26-30					Geo-26-30					KOR										IND										JPN
Model	ACC	PASS@3	18	19	20	21	22	PS-26	PS-27	PS-28	PS-29	PS-30	Cal-26	Cal-27	Cal-28	Cal-29	Cal-30	Geo-26	Geo-27	Geo-28	Geo-29	Geo-30	Chung-Ang Univ	Dongguk Univ	Ewha Womans Univ	Hanyang Univ	Konkuk Univ	Korea Univ	Kyung Hee Univ	Sungkyunkwan Univ	Sogang Univ	Yonsei Univ	IND-1	IND-2	IND-3	IND-4	IND-5	IND-6	IND-7	IND-8	IND-9	IND-10	JPN-1	JPN-2	JPN-3	JPN-4	JPN-5	JPN-6	JPN-7	JPN-8	JPN-9	JPN-10
API / Others
Gemini-3-Pro-Preview	96.0	96.0	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1
GPT-5.2 (high)	86.0	86.0	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	0/1	1/1	0/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1
Claude-Opus-4.5	86.0	86.0	1/1	1/1	1/1	0/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	0/1	0/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1
Grok-4.1-fast	82.0	82.0	1/1	1/1	1/1	0/1	1/1	1/1	1/1	0/1	1/1	0/1	1/1	1/1	1/1	0/1	0/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1
GPT-5.1 (high)	82.0	82.0	1/1	1/1	1/1	0/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	0/1	1/1	0/1	1/1	1/1	0/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1
Deepseek-V3.2	78.0	78.0	1/1	1/1	1/1	0/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	0/1	0/1	1/1	0/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	0/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1
K-LLM Project Round 2
K-EXAONE-236B-A23B	71.3	88.0	3/3	3/3	1/3	3/3	2/3	3/3	1/3	0/3	3/3	0/3	3/3	3/3	3/3	0/3	1/3	3/3	1/3	2/3	1/3	1/3	2/3	2/3	2/3	3/3	0/3	3/3	3/3	3/3	2/3	1/3	3/3	0/3	3/3	3/3	3/3	3/3	3/3	2/3	3/3	0/3	3/3	3/3	3/3	3/3	2/3	3/3	3/3	2/3	3/3	1/3
Solar-Open-100B	74.0	74.0	1/1	1/1	1/1	0/1	0/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	0/1	0/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	0/1	1/1	0/1	1/1	0/1
K-EXAONE-236B-A23B	70.0	70.0	1/1	1/1	1/1	0/1	0/1	1/1	1/1	0/1	1/1	1/1	0/1	1/1	0/1	0/1	0/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	0/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	0/1
K-LLM Project Round 1
Solar-Pro-2 (31B)(high)	60.0	60.0	1/1	1/1	1/1	0/1	1/1	1/1	1/1	0/1	0/1	0/1	1/1	1/1	1/1	0/1	0/1	1/1	0/1	1/1	0/1	1/1	0/1	0/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	0/1	1/1	1/1	0/1	1/1	1/1	0/1	1/1	1/1	1/1	0/1	0/1	0/1	1/1	0/1	1/1	0/1
HCX-007(high)	26.0	26.0	1/1	1/1	0/1	0/1	0/1	1/1	0/1	0/1	1/1	0/1	0/1	0/1	0/1	0/1	0/1	1/1	0/1	1/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	1/1	0/1	0/1	0/1	1/1	1/1	0/1	1/1	1/1	0/1	0/1	0/1	1/1	0/1	1/1	0/1	0/1	0/1	0/1	0/1
EXAONE-4.0.1-32B (high)	24.0	24.0	1/1	0/1	0/1	0/1	0/1	1/1	0/1	0/1	1/1	0/1	1/1	1/1	0/1	0/1	0/1	0/1	0/1	1/1	0/1	1/1	0/1	0/1	0/1	0/1	0/1	1/1	0/1	0/1	0/1	0/1	1/1	0/1	0/1	0/1	0/1	1/1	0/1	1/1	0/1	0/1	0/1	0/1	1/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1
A.X-4.0 (72B)	24.0	24.0	1/1	1/1	0/1	0/1	0/1	1/1	1/1	0/1	1/1	0/1	0/1	0/1	0/1	0/1	0/1	1/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	1/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	1/1	0/1	0/1	1/1	0/1	0/1	0/1	1/1	0/1	0/1	0/1	0/1	1/1	0/1	1/1	0/1
Llama-VARCO-8B-Instruct	2.0	2.0	1/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1
Local - KR
Kanana-2-30B-Thinking-2601	65.0	80.0	2/2	2/2	1/2	1/2	2/2	2/2	2/2	0/2	2/2	1/2	2/2	2/2	2/2	0/2	0/2	1/2	0/2	1/2	0/2	1/2	1/2	2/2	2/2	1/2	0/2	2/2	0/2	2/2	2/2	0/2	2/2	1/2	2/2	2/2	2/2	1/2	1/2	2/2	2/2	0/2	2/2	2/2	2/2	1/2	1/2	0/2	2/2	2/2	1/2	1/2
Kanana-2-30B-Thinking	60.0	60.0	1/1	1/1	1/1	0/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	0/1	0/1	1/1	0/1	1/1	0/1	1/1	0/1	0/1	1/1	1/1	0/1	1/1	0/1	1/1	0/1	0/1	1/1	0/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	0/1	1/1	1/1	1/1	0/1	0/1	0/1	0/1	0/1	0/1	1/1