Table of Contents
Fetching ...

Iterative LLM-Based Generation and Refinement of Distracting Conditions in Math Word Problems

Kaiqi Yang, Hang Li, Yucheng Chu, Zitao Liu, Mi Tian, Hui Liu

TL;DR

This paper addresses the challenge of distracting or irrelevant information in math word problems (MWPs) and how it impacts large language models (LLMs). It introduces IGC-MWP, an LLM-driven iterative framework that generates distracting conditions while ensuring the original solution remains unchanged, thereby reducing annotation effort and preserving data quality. The approach relies on a structured five-step prompt set and an automatic rejection mechanism to progressively refine problems through generation, quantitative/difficulty checks, and desirable/undesirable trait assessments. Experiments on GSM-8K show that IGC-MWP yields higher-quality distractors, resulting in the largest performance drops among baselines and demonstrating improved realism and cognitive difficulty. The framework offers a scalable, deployable method for benchmarking and improving LLM reasoning in MWPs, with future work focusing on quantitative quality metrics and contrastive tuning to further boost mathematical reasoning capabilities.

Abstract

Mathematical reasoning serves as a crucial testbed for the intelligence of large language models (LLMs), and math word problems (MWPs) are a popular type of math problems. Most MWP datasets consist of problems containing only the necessary information, while problems with distracting and excessive conditions are often overlooked. Prior works have tested popular LLMs and found a dramatic performance drop in the presence of distracting conditions. However, datasets of MWPs with distracting conditions are limited, and most suffer from lower levels of difficulty and out-of-context expressions. This makes distracting conditions easy to identify and exclude, thus reducing the credibility of benchmarking on them. Moreover, when adding distracting conditions, the reasoning and answers may also change, requiring intensive labor to check and write the solutions. To address these issues, we design an iterative framework to generate distracting conditions using LLMs. We develop a set of prompts to revise MWPs from different perspectives and cognitive levels, encouraging the generation of distracting conditions as well as suggestions for further revision. Another advantage is the shared solutions between original and revised problems: we explicitly guide the LLMs to generate distracting conditions that do not alter the original solutions, thus avoiding the need to generate new solutions. This framework is efficient and easy to deploy, reducing the overhead of generating MWPs with distracting conditions while maintaining data quality.

Iterative LLM-Based Generation and Refinement of Distracting Conditions in Math Word Problems

TL;DR

This paper addresses the challenge of distracting or irrelevant information in math word problems (MWPs) and how it impacts large language models (LLMs). It introduces IGC-MWP, an LLM-driven iterative framework that generates distracting conditions while ensuring the original solution remains unchanged, thereby reducing annotation effort and preserving data quality. The approach relies on a structured five-step prompt set and an automatic rejection mechanism to progressively refine problems through generation, quantitative/difficulty checks, and desirable/undesirable trait assessments. Experiments on GSM-8K show that IGC-MWP yields higher-quality distractors, resulting in the largest performance drops among baselines and demonstrating improved realism and cognitive difficulty. The framework offers a scalable, deployable method for benchmarking and improving LLM reasoning in MWPs, with future work focusing on quantitative quality metrics and contrastive tuning to further boost mathematical reasoning capabilities.

Abstract

Mathematical reasoning serves as a crucial testbed for the intelligence of large language models (LLMs), and math word problems (MWPs) are a popular type of math problems. Most MWP datasets consist of problems containing only the necessary information, while problems with distracting and excessive conditions are often overlooked. Prior works have tested popular LLMs and found a dramatic performance drop in the presence of distracting conditions. However, datasets of MWPs with distracting conditions are limited, and most suffer from lower levels of difficulty and out-of-context expressions. This makes distracting conditions easy to identify and exclude, thus reducing the credibility of benchmarking on them. Moreover, when adding distracting conditions, the reasoning and answers may also change, requiring intensive labor to check and write the solutions. To address these issues, we design an iterative framework to generate distracting conditions using LLMs. We develop a set of prompts to revise MWPs from different perspectives and cognitive levels, encouraging the generation of distracting conditions as well as suggestions for further revision. Another advantage is the shared solutions between original and revised problems: we explicitly guide the LLMs to generate distracting conditions that do not alter the original solutions, thus avoiding the need to generate new solutions. This framework is efficient and easy to deploy, reducing the overhead of generating MWPs with distracting conditions while maintaining data quality.

Paper Structure

This paper contains 16 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Framework of our irrelevant condition (IC) generation. The Yellow rectangles indicate the judgment mechanisms, while the Green folders represent the prompts along with their generation and revision mechanisms. Dashed arrows denote the flow of judgment outputs (Yes: $\dashrightarrow$ and No: $\dashrightarrow$), while solid arrows denote the transmission of textual content.
  • Figure 2: Example of Step 1 Prompts Initial Generation. The added irrelevant conditions are marked with underlines.
  • Figure 3: Example of Step 2 Prompts Quantitative Relationship Checking
  • Figure 4: Example of Step 3 Prompts Difficulty Level Checking
  • Figure 5: Example of Step 4 Prompts Desirable Characteristics Checking
  • ...and 3 more figures