Table of Contents
Fetching ...

What Makes Math Word Problems Challenging for LLMs?

KV Aditya Srivatsa, Ekaterina Kochmar

TL;DR

This work analyzes why math word problems (MWPs) in English pose difficulty for large language models (LLMs) by decomposing problems into linguistic, mathematical, and world-knowledge components. Using GSM8K, the authors collect solution attempts from multiple open-source LLMs and train feature-based classifiers to predict whether a given MWP will be solved correctly by a particular model, highlighting which features most influence difficulty. Key findings show that higher diversity and count of math operations, infrequent numeric tokens, longer and more complex phrasing, and required real-world knowledge all contribute to reduced solvability, with feature importance varying by model and dataset subset. The study demonstrates that relatively simple, interpretable classifiers can approach transformer-based baselines in predicting solvability and provides actionable insights for dataset design, prompt construction, and future research into the reasoning needs of LLMs for mathematical tasks.

Abstract

This paper investigates the question of what makes math word problems (MWPs) in English challenging for large language models (LLMs). We conduct an in-depth analysis of the key linguistic and mathematical characteristics of MWPs. In addition, we train feature-based classifiers to better understand the impact of each feature on the overall difficulty of MWPs for prominent LLMs and investigate whether this helps predict how well LLMs fare against specific categories of MWPs.

What Makes Math Word Problems Challenging for LLMs?

TL;DR

This work analyzes why math word problems (MWPs) in English pose difficulty for large language models (LLMs) by decomposing problems into linguistic, mathematical, and world-knowledge components. Using GSM8K, the authors collect solution attempts from multiple open-source LLMs and train feature-based classifiers to predict whether a given MWP will be solved correctly by a particular model, highlighting which features most influence difficulty. Key findings show that higher diversity and count of math operations, infrequent numeric tokens, longer and more complex phrasing, and required real-world knowledge all contribute to reduced solvability, with feature importance varying by model and dataset subset. The study demonstrates that relatively simple, interpretable classifiers can approach transformer-based baselines in predicting solvability and provides actionable insights for dataset design, prompt construction, and future research into the reasoning needs of LLMs for mathematical tasks.

Abstract

This paper investigates the question of what makes math word problems (MWPs) in English challenging for large language models (LLMs). We conduct an in-depth analysis of the key linguistic and mathematical characteristics of MWPs. In addition, we train feature-based classifiers to better understand the impact of each feature on the overall difficulty of MWPs for prominent LLMs and investigate whether this helps predict how well LLMs fare against specific categories of MWPs.
Paper Structure (35 sections, 6 figures, 8 tables)

This paper contains 35 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: A response from Llama2-70B to a lengthy math problem that involves NLU challenges.
  • Figure 2: Number of questions from the GSM8K-test (a) always and (b) never answered correctly by each LLM. The rows in each figure correspond to individual LLMs, with the counts on the right denoting the total number of questions always (or never) answered correctly by each LLM. The counts at the bottom denote the number of questions in each subset of LLMs.
  • Figure 3: Results of the ablation studies across feature types ( L -- Linguistic, M -- Mathematical, W -- World Knowledge & NLU). Each bar represents the mean macro-F1 score over all three classifier models.
  • Figure 4: Solution attempt by Llama2-70B on the question from Figure \ref{['fig:sample_question']}, with the required real-world knowledge explicitly specified.
  • Figure 5: Spearman correlation matrix between features. All correlation values are marked with '*', '**', and '***' if their corresponding p-values are less than $0.05$, $0.01$, and $0.001$ respectively.
  • ...and 1 more figures