Table of Contents
Fetching ...

Adversarial Math Word Problem Generation

Roy Xie, Chengxuan Huang, Junlin Wang, Bhuwan Dhingra

TL;DR

The paper addresses fair assessment of student math problem solving in the presence of powerful LLMs by introducing an AST-based adversarial MWP generation framework that preserves original difficulty while rendering problems unsolvable by LLMs. It maps MWPs to Python code and then to an AST, applying educational constraints to generate variants through three generation modes with increasing restrictiveness, and validates the approach across multiple open- and closed-source models. The authors demonstrate substantial attack effectiveness, outperforming rephrasing baselines by an average of 62 ASR points, analyze universal and transferability properties, and show that efficient, cost-aware attacks can target high-cost models with far fewer API calls. Human evaluations corroborate coherence and difficulty preservation for the most restrictive method, while regression analyses reveal model-specific weaknesses and broader implications for fair educational use and model robustness. Overall, the work contributes a scalable, principled method for stress-testing LLM math reasoning and informs the design of fairer, more robust educational tools in contexts where AI assistance is prevalent.

Abstract

Large language models (LLMs) have significantly transformed the educational landscape. As current plagiarism detection tools struggle to keep pace with LLMs' rapid advancements, the educational community faces the challenge of assessing students' true problem-solving abilities in the presence of LLMs. In this work, we explore a new paradigm for ensuring fair evaluation -- generating adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. Focusing on the domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause LLMs to produce incorrect answers by simply editing the numeric values in the problems. We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. We identify shared vulnerabilities among LLMs and propose a cost-effective approach to attack high-cost models. Additionally, we conduct automatic analysis to investigate the cause of failure, providing further insights into the limitations of LLMs.

Adversarial Math Word Problem Generation

TL;DR

The paper addresses fair assessment of student math problem solving in the presence of powerful LLMs by introducing an AST-based adversarial MWP generation framework that preserves original difficulty while rendering problems unsolvable by LLMs. It maps MWPs to Python code and then to an AST, applying educational constraints to generate variants through three generation modes with increasing restrictiveness, and validates the approach across multiple open- and closed-source models. The authors demonstrate substantial attack effectiveness, outperforming rephrasing baselines by an average of 62 ASR points, analyze universal and transferability properties, and show that efficient, cost-aware attacks can target high-cost models with far fewer API calls. Human evaluations corroborate coherence and difficulty preservation for the most restrictive method, while regression analyses reveal model-specific weaknesses and broader implications for fair educational use and model robustness. Overall, the work contributes a scalable, principled method for stress-testing LLM math reasoning and informs the design of fairer, more robust educational tools in contexts where AI assistance is prevalent.

Abstract

Large language models (LLMs) have significantly transformed the educational landscape. As current plagiarism detection tools struggle to keep pace with LLMs' rapid advancements, the educational community faces the challenge of assessing students' true problem-solving abilities in the presence of LLMs. In this work, we explore a new paradigm for ensuring fair evaluation -- generating adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. Focusing on the domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause LLMs to produce incorrect answers by simply editing the numeric values in the problems. We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. We identify shared vulnerabilities among LLMs and propose a cost-effective approach to attack high-cost models. Additionally, we conduct automatic analysis to investigate the cause of failure, providing further insights into the limitations of LLMs.
Paper Structure (48 sections, 2 equations, 8 figures, 8 tables, 2 algorithms)

This paper contains 48 sections, 2 equations, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: Method Overview: Given a MWP that an LLM can correctly solve, our method first transforms it into Python code. The Python code then is converted into an AST representation, which is used to generate adversarial problems by modifying the numeric values in a controllable manner. We place constraints on the nodes of the AST to ensure that the modified problem maintains the same difficulty level as the original problem. Despite this, we find that the resulting adversarial examples cause LLMs to predict incorrect answers.
  • Figure 2: Human Evaluation: The average score from three annotators. $M3$ achieves the highest scores across all metrics, indicating our best generation method correctly generates contextually coherent problems that preserve original difficulty.
  • Figure 3: Transferability: We present the adversarial example transferability (%) among all models by comparing each model against all other models. Compared to the math-tuned and production models, the weaker models such as LLama2 13B exhibit significant vulnerability and a strong correlation among them.
  • Figure 4: A rephrasing attack example from zhou2023mathattack. The left side shows the original problem, and the right side is the rephrased version. Although planes and helicopters are conceptually similar, the maximum flying distance for a helicopter is usually between 300 to 400 miles. The rephrasing introduces a subtle and incorrect factual error into the problem, which could be hard to detect at times.
  • Figure 5: We present the distribution of incorrect adversarial examples in different buckets. For strong models (e.g., GPT-3.5-Turbo, MetaMath 70B), around half of the problems have zero incorrect adversarial examples, while for weaker models (e.g., Llama-2 13B, CodeLlama 34B), most problems have more than 90 incorrect adversarial examples.
  • ...and 3 more figures