EntropyMath Leaderboard

A high-entropy mathematical reasoning benchmark for LLMs

Korean CSAT 2025 (KOR)

Model Accuracy vs Pass@3

100%

75%

50%

25%

GPT-5.2 (high)

Gemini-3-Pro-Preview

K-EXAONE-236B-A23B

EXAONE-4.0.1-32B (high)

Kanana-2-30B-Thinking-2601

Accuracy

Pass@3

Avg Token Usage (Per Problem)

20.1K

15.1K

10K

K-EXAONE-236B-A23B

Gemini-3-Pro-Preview

Kanana-2-30B-Thinking-2601

GPT-5.2 (high)

EXAONE-4.0.1-32B (high)

Avg Tokens / Problem

EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. The KOR_CSAT_25_KOR dataset represents the 2025 Korean College Scholastic Ability Test (CSAT) Math problems. This is a highly challenging benchmark specifically for verifying mathematical reasoning in Korean.

Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.

Performance Legend

Mastery (100%)

3/3

Strong (66%)

2/3

Weak (33%)

1/3

Fail (0%)

0/3

Model	Acc	Pass@3	0	1	2	3	4	5	6	7	8	9	10	11	12
API / Others
GPT-5.2 (high)	100.0	100.0	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1
Gemini-3-Pro-Preview	100.0	100.0	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1	1/1
K-LLM Project Round 2
K-EXAONE-236B-A23B	66.7	84.6	1/3	0/3	3/3	1/3	2/3	3/3	3/3	2/3	3/3	0/3	2/3	3/3	3/3
K-LLM Project Round 1
EXAONE-4.0.1-32B (high)	53.8	76.9	1/3	0/3	3/3	2/3	1/3	1/3	3/3	3/3	1/3	0/3	0/3	3/3	3/3
Local - KR
Kanana-2-30B-Thinking-2601	53.8	69.2	0/3	0/3	3/3	0/3	3/3	2/3	3/3	3/3	3/3	0/3	1/3	2/3	1/3