Can LLMs Solve longer Math Word Problems Better?

Xin Xu; Tong Xiao; Zitong Chao; Zhenya Huang; Can Yang; Yang Wang

Can LLMs Solve longer Math Word Problems Better?

Xin Xu, Tong Xiao, Zitong Chao, Zhenya Huang, Can Yang, Yang Wang

TL;DR

The paper tackles Context Length Generalization (CoLeG) in math word problems by introducing Extended Grade-School Math (E-GSM) and two metrics, CoLeG-E and CoLeG-R. It shows that longer narratives impair LLM math reasoning and proposes Condition-Retrieving Instruction (CoRe) for proprietary LLMs and an extension-based auxiliary task for open-source LLMs, achieving improved CoLeG and robustness. Extensive experiments across multiple proprietary and open-source models, plus generalization to MAWPS, SVAMP, and GSM-IC, demonstrate the effectiveness and generality of the proposed approaches. The work provides practical methods to enhance model generalizability in mathematical reasoning and offers a data-driven framework for evaluating reasoning under extended contexts.

Abstract

Math Word Problems (MWPs) play a vital role in assessing the capabilities of Large Language Models (LLMs), yet current research primarily focuses on questions with concise contexts. The impact of longer contexts on mathematical reasoning remains under-explored. This study pioneers the investigation of Context Length Generalizability (CoLeG), which refers to the ability of LLMs to solve MWPs with extended narratives. We introduce Extended Grade-School Math (E-GSM), a collection of MWPs featuring lengthy narratives, and propose two novel metrics to evaluate the efficacy and resilience of LLMs in tackling these problems. Our analysis of existing zero-shot prompting techniques with proprietary LLMs along with open-source LLMs reveals a general deficiency in CoLeG. To alleviate these issues, we propose tailored approaches for different categories of LLMs. For proprietary LLMs, we introduce a new instructional prompt designed to mitigate the impact of long contexts. For open-source LLMs, we develop a novel auxiliary task for fine-tuning to enhance CoLeG. Our comprehensive results demonstrate the effectiveness of our proposed methods, showing improved performance on E-GSM. Additionally, we conduct an in-depth analysis to differentiate the effects of semantic understanding and reasoning efficacy, showing that our methods improves the latter. We also establish the generalizability of our methods across several other MWP benchmarks. Our findings highlight the limitations of current LLMs and offer practical solutions correspondingly, paving the way for further exploration of model generalizability and training methodologies.

Can LLMs Solve longer Math Word Problems Better?

TL;DR

Abstract

Paper Structure (37 sections, 9 equations, 11 figures, 16 tables)

This paper contains 37 sections, 9 equations, 11 figures, 16 tables.

Introduction
The E-GSM Dataset
LLMs Struggle to Answer Math Word Problems with Longer Context
Dataset Creation and Quality Control
Evaluation Metrics on E-GSM
Methodology
Condition-Retrieving Instruction for Proprietary LLMs
Extension as an Auxiliary Task for Open-source LLMs
Results and Analysis
Experimental Setup
Main Results
Fine-grained Analysis on Semantic Understanding and Math Reasoning
Extension with Specialized Mathematical LLMs
Generalization to Other Benchmarks
Related Work
...and 22 more sections

Figures (11)

Figure 1: The visual comparison suggests the number of tokens in $G_0$ is larger than $G_1$, with Mann-Whitney U test suggesting the significance of these differences. This implies that LLMs struggle to solve longer MWPs, which is similar to humans.
Figure 2: E-GSM creation process and prompt template for extension.
Figure 3: A comparison between solving a long problem (shortened version) with 0-CoT and CoRe.
Figure 4: Informativeness and missing step values of 4 representative LLMs
Figure 5: Left: $\text{Acc}_i$ varying over rounds in E-GSM of LLaMA-2 and MetaMath families. "w" and "w/o" stand for "with" and "without" respectively. Right: Accuracy on GSM8K with short, medium, and long length. The range of tokens within each category is in parenthesis.
...and 6 more figures

Can LLMs Solve longer Math Word Problems Better?

TL;DR

Abstract

Can LLMs Solve longer Math Word Problems Better?

Authors

TL;DR

Abstract

Table of Contents

Figures (11)