Model Accuracy vs Pass@3
100%
75%
50%
25%
0%
Gemini-3-Pro-Preview
GPT-5.2 (high)

Solar Pro 3 (Round 2)

Kanana-2-30B-Thinking-2601

K-EXAONE-236B-A23B
Accuracy
Pass@3
Avg Token Usage (Per Problem)
20.7K
15.5K
10.3K
5.2K
0

K-EXAONE-236B-A23B

Solar Pro 3 (Round 2)
Gemini-3-Pro-Preview

Kanana-2-30B-Thinking-2601
GPT-5.2 (high)
Avg Tokens / Problem
EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs.
Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.
Performance Legend
Mastery (100%)
3/3
Strong (66%)
2/3
Weak (33%)
1/3
Fail (0%)
0/3
| Model | Acc | Pass@3 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| API / Others | ||||||||||||||
Gemini-3-Pro-Preview | 100.0 | 100.0 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 |
GPT-5.2 (high) | 83.3 | 83.3 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 0/1 | 1/1 | 1/1 | 0/1 | 1/1 |
| K-LLM Project Round 2 | ||||||||||||||
Solar Pro 3 (Round 2) | 75.0 | 83.3 | 3/3 | 3/3 | 3/3 | 3/3 | 0/3 | 2/3 | 3/3 | 0/3 | 2/3 | 3/3 | 2/3 | 3/3 |
K-EXAONE-236B-A23B | 58.3 | 75.0 | 3/3 | 2/3 | 3/3 | 3/3 | 1/3 | 0/3 | 3/3 | 0/3 | 3/3 | 0/3 | 1/3 | 2/3 |
| Local - KR | ||||||||||||||
Kanana-2-30B-Thinking-2601 | 61.1 | 83.3 | 3/3 | 3/3 | 3/3 | 0/3 | 1/3 | 3/3 | 3/3 | 0/3 | 1/3 | 2/3 | 1/3 | 2/3 |


