Korean CSAT (College Scholastic Ability Test) Math Problems
Model Accuracy vs Pass@3
100%
75%
50%
25%
0%
Gemini-3-Pro-Preview
Claude-Opus-4.5
Grok-4.1-fast
GPT-5.1 (high)
Deepseek-V3.2

Solar-Pro-2 (31B)(high)

HCX-007(high)

EXAONE-4.0.1-32B (high)

A.X-4.0 (72B)

Llama-VARCO-8B-Instruct
Accuracy
Pass@3
Avg Token Usage (Per Problem)
23.6K
17.7K
11.8K
5.9K
0
Grok-4.1-fast
Deepseek-V3.2
Gemini-3-Pro-Preview
Claude-Opus-4.5

Solar-Pro-2 (31B)(high)
GPT-5.1 (high)

EXAONE-4.0.1-32B (high)

Llama-VARCO-8B-Instruct

HCX-007(high)

A.X-4.0 (72B)
Avg Tokens / Problem
EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. The EntropyMath_SAT_50 dataset challenges models with SAT-style problems derived from the Korean, Indian, and Japanese College Scholastic Ability Test (CSAT). These problems demand not only high-precision calculation but also deep conceptual understanding and logical inference, representing a significant challenge even for advanced LLMs.
Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.
Performance Legend
Mastery (100%)
3/3
Strong (66%)
2/3
Weak (33%)
1/3
Fail (0%)
0/3


