IMDS LogoCicagolab LogoDeep Fountain Logo

EntropyMath Leaderboard

A high-entropy mathematical reasoning benchmark for LLMs

Korean CSAT (College Scholastic Ability Test) Math Problems

Model Accuracy vs Pass@3

100%
75%
50%
25%
0%
Gemini-3-Pro-Preview
Claude-Opus-4.5
Grok-4.1-fast
GPT-5.1 (high)
Deepseek-V3.2
Solar-Pro-2 (31B)(high)
Solar-Pro-2 (31B)(high)
HCX-007(high)
HCX-007(high)
EXAONE-4.0.1-32B (high)
EXAONE-4.0.1-32B (high)
A.X-4.0 (72B)
A.X-4.0 (72B)
Llama-VARCO-8B-Instruct
Llama-VARCO-8B-Instruct
Accuracy
Pass@3

Avg Token Usage (Per Problem)

23.6K
17.7K
11.8K
5.9K
0
Grok-4.1-fast
Deepseek-V3.2
Gemini-3-Pro-Preview
Claude-Opus-4.5
Solar-Pro-2 (31B)(high)
Solar-Pro-2 (31B)(high)
GPT-5.1 (high)
EXAONE-4.0.1-32B (high)
EXAONE-4.0.1-32B (high)
Llama-VARCO-8B-Instruct
Llama-VARCO-8B-Instruct
HCX-007(high)
HCX-007(high)
A.X-4.0 (72B)
A.X-4.0 (72B)
Avg Tokens / Problem

EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. The EntropyMath_SAT_50 dataset challenges models with SAT-style problems derived from the Korean, Indian, and Japanese College Scholastic Ability Test (CSAT). These problems demand not only high-precision calculation but also deep conceptual understanding and logical inference, representing a significant challenge even for advanced LLMs.

Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.

Performance Legend

Mastery (100%)
3/3
Strong (66%)
2/3
Weak (33%)
1/3
Fail (0%)
0/3
ACCPASS@318-22PS-26-30Cal-26-30Geo-26-30KORINDJPN
1819202122PS-26PS-27PS-28PS-29PS-30Cal-26Cal-27Cal-28Cal-29Cal-30Geo-26Geo-27Geo-28Geo-29Geo-30Chung-Ang UnivDongguk UnivEwha Womans UnivHanyang UnivKonkuk UnivKorea UnivKyung Hee UnivSungkyunkwan UnivSogang UnivYonsei UnivIND-1IND-2IND-3IND-4IND-5IND-6IND-7IND-8IND-9IND-10JPN-1JPN-2JPN-3JPN-4JPN-5JPN-6JPN-7JPN-8JPN-9JPN-10
API / Others
Grok-4.1-fast
82.082.01/11/11/10/11/11/11/10/11/10/11/11/11/10/10/11/10/11/11/11/11/11/11/11/10/11/11/11/11/11/11/11/11/11/11/11/11/11/11/10/11/11/11/11/11/11/11/10/11/11/1
GPT-5.1 (high)
80.080.01/11/11/10/10/11/11/10/11/11/11/11/11/10/11/11/10/11/10/11/11/10/11/11/11/10/11/11/11/11/11/11/11/11/11/11/11/11/11/10/11/11/11/10/11/11/11/11/11/11/1
Claude-Opus-4.5
84.084.01/11/11/10/10/11/11/10/11/11/11/11/11/10/10/11/10/11/11/11/11/11/11/11/10/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/10/1
Gemini-3-Pro-Preview
92.092.01/11/11/10/10/11/11/10/11/11/11/11/11/10/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/11/1
Deepseek-V3.2
76.076.01/11/11/10/10/11/11/10/11/11/11/11/11/10/10/11/10/11/10/11/11/11/11/11/10/11/11/11/11/11/10/10/11/11/11/11/11/11/11/10/11/11/11/11/11/11/11/10/11/11/1
Soverign 5 - KR
Solar-Pro-2 (31B)(high)Solar-Pro-2 (31B)(high)
58.058.01/11/11/10/10/11/11/10/10/10/11/11/11/10/10/11/10/11/10/11/10/10/11/11/10/11/11/11/11/11/10/11/11/10/11/11/10/11/11/10/11/11/11/10/10/10/11/10/11/10/1
EXAONE-4.0.1-32B (high)EXAONE-4.0.1-32B (high)
24.024.01/10/10/10/10/11/10/10/11/10/11/11/10/10/10/10/10/11/10/11/10/10/10/10/10/11/10/10/10/10/11/10/10/10/10/11/10/11/10/10/10/10/11/10/10/10/10/10/10/10/1
HCX-007(high)HCX-007(high)
26.026.01/11/10/10/10/11/10/10/11/10/10/10/10/10/10/11/10/11/10/10/10/10/10/10/10/10/10/10/10/10/11/10/10/10/11/11/10/11/11/10/10/10/11/10/11/10/10/10/10/10/1
A.X-4.0 (72B)A.X-4.0 (72B)
24.024.01/11/10/10/10/11/11/10/11/10/10/10/10/10/10/11/10/10/10/10/10/10/10/10/10/11/10/10/10/10/10/10/10/10/11/10/10/11/10/10/10/11/10/10/10/10/11/10/11/10/1
Llama-VARCO-8B-InstructLlama-VARCO-8B-Instruct
2.02.01/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/10/1