EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing
Shengbo Wang, Mingwei Liu, Zike Li, Anji Li, Yanlin Wang, Xin Peng, Zibin Zheng
TL;DR
The paper tackles the validity crisis of static mathematical benchmarks by introducing EvolMathEval, an automated framework that uses evolutionary testing to continuously generate and evolve challenging math problems. Built from seed problem initialization, genetic operators, a crossover mechanism, and a data-driven fitness function, EvolMathEval can create from-scratch problems and harden existing datasets such as GSM8K. Key findings include substantial reductions in model accuracy on evolved benchmarks (average ~48%, up to 95% in some cases) and the identification of a prevalent failure mode termed the Pseudo Aha Moment, where models shortcut reasoning rather than performing deep multi-step deduction. The framework generalizes to multiple public datasets, enhances model discriminability, and provides a path toward more robust evaluation and targeted improvement of mathematical reasoning in LLMs.
Abstract
The rapid advancement of Large Language Models (LLMs) poses a significant challenge to existing mathematical reasoning benchmarks. However, these benchmarks tend to become easier over time as LLMs can learn from the published benchmarks. This limitation hinder the precise evaluation of the true capabilities of SOTA models. To address this challenge, this paper introduces EvolMathEval, an automated mathematical benchmark generation and evolution framework based on evolutionary testing. Experimental results demonstrate that EvolMathEval can not only generate a large volume of high-difficulty problems through continuous self-iteration, but it can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48\%. Deeper investigation reveals that when solving these evolved problems, LLMs tend to bypass complex multi-step logical reasoning by relying on simplistic and fuzzy conditions, consequently leading to incorrect solutions. We define this phenomenon as the ``Pseudo Aha Moment", which we find accounts for 77\% to 100\% of errors on targeted problems. Code and resources are available at: https://anonymous.4open.science/r/EvolMathEval
