IMDS LogoCicagolab LogoDeep Fountain Logo

EntropyMath Leaderboard

A high-entropy mathematical reasoning benchmark for LLMs

Derivatives from seed problems

Model Accuracy vs Pass@3

100%
75%
50%
25%
0%
Deepseek-V3.2
Grok-4.1-fast
GPT-5.1 (high)
Claude-Opus-4.5
Gemini-3-Pro-Preview
Accuracy
Pass@3

Avg Token Usage (Per Problem)

21.1K
15.9K
10.6K
5.3K
0
Grok-4.1-fast
Gemini-3-Pro-Preview
Deepseek-V3.2
Claude-Opus-4.5
GPT-5.1 (high)
Avg Tokens / Problem

EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. The EntropyMath_Open_10 benchmark contains derivative problems generated from the seed set. This dataset tests the model's robustness and generalization ability by presenting variations of known problem distributions, ensuring that performance is not merely due to memorization.

Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.

Performance Legend

Mastery (100%)
3/3
Strong (66%)
2/3
Weak (33%)
1/3
Fail (0%)
0/3
ModelAccPass@30123456789
API / Others
Grok-4.1-fast
93.3100.03/33/32/33/33/33/33/33/32/33/3
GPT-5.1 (high)
93.3100.03/33/33/33/33/32/33/33/32/33/3
Claude-Opus-4.5
90.0100.03/33/33/33/33/33/32/33/31/33/3
Gemini-3-Pro-Preview
90.0100.03/32/32/33/33/33/32/33/33/33/3
Deepseek-V3.2
100.0100.03/33/33/33/33/33/33/33/33/33/3