Adversarial Math Word Problem Generation
Roy Xie, Chengxuan Huang, Junlin Wang, Bhuwan Dhingra
TL;DR
The paper addresses fair assessment of student math problem solving in the presence of powerful LLMs by introducing an AST-based adversarial MWP generation framework that preserves original difficulty while rendering problems unsolvable by LLMs. It maps MWPs to Python code and then to an AST, applying educational constraints to generate variants through three generation modes with increasing restrictiveness, and validates the approach across multiple open- and closed-source models. The authors demonstrate substantial attack effectiveness, outperforming rephrasing baselines by an average of 62 ASR points, analyze universal and transferability properties, and show that efficient, cost-aware attacks can target high-cost models with far fewer API calls. Human evaluations corroborate coherence and difficulty preservation for the most restrictive method, while regression analyses reveal model-specific weaknesses and broader implications for fair educational use and model robustness. Overall, the work contributes a scalable, principled method for stress-testing LLM math reasoning and informs the design of fairer, more robust educational tools in contexts where AI assistance is prevalent.
Abstract
Large language models (LLMs) have significantly transformed the educational landscape. As current plagiarism detection tools struggle to keep pace with LLMs' rapid advancements, the educational community faces the challenge of assessing students' true problem-solving abilities in the presence of LLMs. In this work, we explore a new paradigm for ensuring fair evaluation -- generating adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. Focusing on the domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause LLMs to produce incorrect answers by simply editing the numeric values in the problems. We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. We identify shared vulnerabilities among LLMs and propose a cost-effective approach to attack high-cost models. Additionally, we conduct automatic analysis to investigate the cause of failure, providing further insights into the limitations of LLMs.
