IMDS LogoCicagolab LogoDeep Fountain Logo

EntropyMath Leaderboard

A high-entropy mathematical reasoning benchmark for LLMs

Korean CSAT (College Scholastic Ability Test) Math Problems

Model Accuracy vs Pass@3

100%
75%
50%
25%
0%
Gemini-3-Pro-Preview
GPT-5.2 (high)
Claude-Opus-4.5
Grok-4.1-fast
GPT-5.1 (high)
Deepseek-V3.2
Solar-Open-100B
Solar-Open-100B
K-EXAONE-236B-A23B
K-EXAONE-236B-A23B
K-EXAONE-236B-A23B
K-EXAONE-236B-A23B
Kanana-2-30B-Thinking-2601
Kanana-2-30B-Thinking-2601
Solar-Pro-2 (31B)(high)
Solar-Pro-2 (31B)(high)
Kanana-2-30B-Thinking
Kanana-2-30B-Thinking
HCX-007(high)
HCX-007(high)
EXAONE-4.0.1-32B (high)
EXAONE-4.0.1-32B (high)
A.X-4.0 (72B)
A.X-4.0 (72B)
axk1
Llama-VARCO-8B-Instruct
Llama-VARCO-8B-Instruct
Accuracy
Pass@3

Avg Token Usage (Per Problem)

112K
84K
56K
28K
0
K-EXAONE-236B-A23B
K-EXAONE-236B-A23B
K-EXAONE-236B-A23B
K-EXAONE-236B-A23B
Grok-4.1-fast
Solar-Open-100B
Solar-Open-100B
Kanana-2-30B-Thinking-2601
Kanana-2-30B-Thinking-2601
Kanana-2-30B-Thinking
Kanana-2-30B-Thinking
Deepseek-V3.2
Gemini-3-Pro-Preview
Claude-Opus-4.5
Solar-Pro-2 (31B)(high)
Solar-Pro-2 (31B)(high)
GPT-5.1 (high)
EXAONE-4.0.1-32B (high)
EXAONE-4.0.1-32B (high)
GPT-5.2 (high)
Llama-VARCO-8B-Instruct
Llama-VARCO-8B-Instruct
HCX-007(high)
HCX-007(high)
A.X-4.0 (72B)
A.X-4.0 (72B)
axk1
Avg Tokens / Problem

EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. The EntropyMath_SAT_50 dataset challenges models with SAT-style problems derived from the Korean, Indian, and Japanese College Scholastic Ability Test (CSAT). These problems demand not only high-precision calculation but also deep conceptual understanding and logical inference, representing a significant challenge even for advanced LLMs.

Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.

Performance Legend

Mastery (100%)
3/3
Strong (66%)
2/3
Weak (33%)
1/3
Fail (0%)
0/3
ModelACCPASS@318-22PS-26-30Cal-26-30Geo-26-30KORINDJPN
1819202122PS-26PS-27PS-28PS-29PS-30Cal-26Cal-27Cal-28Cal-29Cal-30Geo-26Geo-27Geo-28Geo-29Geo-30Chung-Ang UnivDongguk UnivEwha Womans UnivHanyang UnivKonkuk UnivKorea UnivKyung Hee UnivSungkyunkwan UnivSogang UnivYonsei UnivIND-1IND-2IND-3IND-4IND-5IND-6IND-7IND-8IND-9IND-10JPN-1JPN-2JPN-3JPN-4JPN-5JPN-6JPN-7JPN-8JPN-9JPN-10
API / Others
Gemini-3-Pro-Preview
96.096.01/11/11/11/11/11/11/10/11/11/11/11/11/10/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/1
GPT-5.2 (high)
86.086.01/11/11/11/11/11/11/10/11/11/11/11/11/10/11/11/10/11/10/11/11/10/11/11/11/11/11/11/11/11/11/10/11/11/11/11/11/11/11/10/11/11/11/11/11/11/11/11/11/11/1
Claude-Opus-4.5
86.086.01/11/11/10/11/11/11/10/11/11/11/11/11/10/10/11/10/11/11/11/11/11/11/11/10/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/10/1
Grok-4.1-fast
82.082.01/11/11/10/11/11/11/10/11/10/11/11/11/10/10/11/10/11/11/11/11/11/11/11/10/11/11/11/11/11/11/11/11/11/11/11/11/11/11/10/11/11/11/11/11/11/11/10/11/11/1
GPT-5.1 (high)
82.082.01/11/11/10/11/11/11/10/11/11/11/11/11/10/11/11/10/11/10/11/11/10/11/11/11/10/11/11/11/11/11/11/11/11/11/11/11/11/11/10/11/11/11/10/11/11/11/11/11/11/1
Deepseek-V3.2
78.078.01/11/11/10/11/11/11/10/11/11/11/11/11/10/10/11/10/11/10/11/11/11/11/11/10/11/11/11/11/11/10/10/11/11/11/11/11/11/11/10/11/11/11/11/11/11/11/10/11/11/1
K-LLM Project Round 2
K-EXAONE-236B-A23BK-EXAONE-236B-A23B
71.388.03/33/31/33/32/33/31/30/33/30/33/33/33/30/31/33/31/32/31/31/32/32/32/33/30/33/33/33/32/31/33/30/33/33/33/33/33/32/33/30/33/33/33/33/32/33/33/32/33/31/3
Solar-Open-100BSolar-Open-100B
74.074.01/11/11/10/10/11/11/10/11/11/11/11/11/10/10/11/10/11/11/11/11/11/11/11/10/11/11/11/11/11/11/10/11/11/10/11/11/11/11/11/11/11/11/10/11/10/11/10/11/10/1
K-EXAONE-236B-A23BK-EXAONE-236B-A23B
70.070.01/11/11/10/10/11/11/10/11/11/10/11/10/10/10/11/10/11/11/11/11/11/11/11/10/11/10/11/11/11/11/10/11/11/11/11/11/11/11/10/11/10/11/11/11/11/11/10/11/10/1
K-LLM Project Round 1
Solar-Pro-2 (31B)(high)Solar-Pro-2 (31B)(high)
60.060.01/11/11/10/11/11/11/10/10/10/11/11/11/10/10/11/10/11/10/11/10/10/11/11/10/11/11/11/11/11/10/11/11/10/11/11/10/11/11/10/11/11/11/10/10/10/11/10/11/10/1
HCX-007(high)HCX-007(high)
26.026.01/11/10/10/10/11/10/10/11/10/10/10/10/10/10/11/10/11/10/10/10/10/10/10/10/10/10/10/10/10/11/10/10/10/11/11/10/11/11/10/10/10/11/10/11/10/10/10/10/10/1
EXAONE-4.0.1-32B (high)EXAONE-4.0.1-32B (high)
24.024.01/10/10/10/10/11/10/10/11/10/11/11/10/10/10/10/10/11/10/11/10/10/10/10/10/11/10/10/10/10/11/10/10/10/10/11/10/11/10/10/10/10/11/10/10/10/10/10/10/10/1
A.X-4.0 (72B)A.X-4.0 (72B)
24.024.01/11/10/10/10/11/11/10/11/10/10/10/10/10/10/11/10/10/10/10/10/10/10/10/10/11/10/10/10/10/10/10/10/10/11/10/10/11/10/10/10/11/10/10/10/10/11/10/11/10/1
Llama-VARCO-8B-InstructLlama-VARCO-8B-Instruct
2.02.01/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/1
Local - KR
Kanana-2-30B-Thinking-2601Kanana-2-30B-Thinking-2601
65.080.02/22/21/21/22/22/22/20/22/21/22/22/22/20/20/21/20/21/20/21/21/22/22/21/20/22/20/22/22/20/22/21/22/22/22/21/21/22/22/20/22/22/22/21/21/20/22/22/21/21/2
Kanana-2-30B-ThinkingKanana-2-30B-Thinking
60.060.01/11/11/10/11/11/11/10/11/11/11/11/11/10/10/11/10/11/10/11/10/10/11/11/10/11/10/11/10/10/11/10/11/11/11/11/11/11/11/10/11/11/11/10/10/10/10/10/10/11/1