What Makes Math Word Problems Challenging for LLMs?
KV Aditya Srivatsa, Ekaterina Kochmar
TL;DR
This work analyzes why math word problems (MWPs) in English pose difficulty for large language models (LLMs) by decomposing problems into linguistic, mathematical, and world-knowledge components. Using GSM8K, the authors collect solution attempts from multiple open-source LLMs and train feature-based classifiers to predict whether a given MWP will be solved correctly by a particular model, highlighting which features most influence difficulty. Key findings show that higher diversity and count of math operations, infrequent numeric tokens, longer and more complex phrasing, and required real-world knowledge all contribute to reduced solvability, with feature importance varying by model and dataset subset. The study demonstrates that relatively simple, interpretable classifiers can approach transformer-based baselines in predicting solvability and provides actionable insights for dataset design, prompt construction, and future research into the reasoning needs of LLMs for mathematical tasks.
Abstract
This paper investigates the question of what makes math word problems (MWPs) in English challenging for large language models (LLMs). We conduct an in-depth analysis of the key linguistic and mathematical characteristics of MWPs. In addition, we train feature-based classifiers to better understand the impact of each feature on the overall difficulty of MWPs for prominent LLMs and investigate whether this helps predict how well LLMs fare against specific categories of MWPs.
