What Makes Math Word Problems Challenging for LLMs?

KV Aditya Srivatsa; Ekaterina Kochmar

What Makes Math Word Problems Challenging for LLMs?

KV Aditya Srivatsa, Ekaterina Kochmar

TL;DR

This work analyzes why math word problems (MWPs) in English pose difficulty for large language models (LLMs) by decomposing problems into linguistic, mathematical, and world-knowledge components. Using GSM8K, the authors collect solution attempts from multiple open-source LLMs and train feature-based classifiers to predict whether a given MWP will be solved correctly by a particular model, highlighting which features most influence difficulty. Key findings show that higher diversity and count of math operations, infrequent numeric tokens, longer and more complex phrasing, and required real-world knowledge all contribute to reduced solvability, with feature importance varying by model and dataset subset. The study demonstrates that relatively simple, interpretable classifiers can approach transformer-based baselines in predicting solvability and provides actionable insights for dataset design, prompt construction, and future research into the reasoning needs of LLMs for mathematical tasks.

Abstract

This paper investigates the question of what makes math word problems (MWPs) in English challenging for large language models (LLMs). We conduct an in-depth analysis of the key linguistic and mathematical characteristics of MWPs. In addition, we train feature-based classifiers to better understand the impact of each feature on the overall difficulty of MWPs for prominent LLMs and investigate whether this helps predict how well LLMs fare against specific categories of MWPs.

What Makes Math Word Problems Challenging for LLMs?

TL;DR

Abstract

Paper Structure (35 sections, 6 figures, 8 tables)

This paper contains 35 sections, 6 figures, 8 tables.

Introduction
Methodology
Data
Approach
LLMs
Features
Experiments
Solution Generation
Success Rate Prediction
Models
Data
Preprocessing & Optimization
Results
Success Rate Distribution
Classification Results
...and 20 more sections

Figures (6)

Figure 1: A response from Llama2-70B to a lengthy math problem that involves NLU challenges.
Figure 2: Number of questions from the GSM8K-test (a) always and (b) never answered correctly by each LLM. The rows in each figure correspond to individual LLMs, with the counts on the right denoting the total number of questions always (or never) answered correctly by each LLM. The counts at the bottom denote the number of questions in each subset of LLMs.
Figure 3: Results of the ablation studies across feature types ( L -- Linguistic, M -- Mathematical, W -- World Knowledge & NLU). Each bar represents the mean macro-F1 score over all three classifier models.
Figure 4: Solution attempt by Llama2-70B on the question from Figure \ref{['fig:sample_question']}, with the required real-world knowledge explicitly specified.
Figure 5: Spearman correlation matrix between features. All correlation values are marked with '*', '**', and '***' if their corresponding p-values are less than $0.05$, $0.01$, and $0.001$ respectively.
...and 1 more figures

What Makes Math Word Problems Challenging for LLMs?

TL;DR

Abstract

What Makes Math Word Problems Challenging for LLMs?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)