Table of Contents
Fetching ...

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Ziling Cheng, Meng Cao, Leila Pishdad, Yanshuai Cao, Jackie Chi Kit Cheung

TL;DR

The study questions the reliance on final-answer accuracy as a measure of mathematical reasoning in large language models by disentangling two core skills: abstract formulation and arithmetic computation. Using GSM8K and SVAMP, it shows that models can often form abstractions without Chain-of-Thought prompts, while computation remains the bottleneck for final answers; CoT mainly enhances calculation rather than abstraction. Mechanistic interpretability methods reveal an abstract-then-compute mechanism operating within a single forward pass, with abstractions formed early and transferrable across symbolic and numerical forms, subsequently guiding computation. Cross-prompt patching demonstrates that these abstractions are causal, transferable, and composable, underscoring the need for disentangled evaluation to accurately assess model reasoning and guide future improvements. Overall, the work advocates a shift from sole reliance on final answers toward nuanced diagnostics that pinpoint where arithmetic errors, not reasoning gaps, limit performance.

Abstract

Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

TL;DR

The study questions the reliance on final-answer accuracy as a measure of mathematical reasoning in large language models by disentangling two core skills: abstract formulation and arithmetic computation. Using GSM8K and SVAMP, it shows that models can often form abstractions without Chain-of-Thought prompts, while computation remains the bottleneck for final answers; CoT mainly enhances calculation rather than abstraction. Mechanistic interpretability methods reveal an abstract-then-compute mechanism operating within a single forward pass, with abstractions formed early and transferrable across symbolic and numerical forms, subsequently guiding computation. Cross-prompt patching demonstrates that these abstractions are causal, transferable, and composable, underscoring the need for disentangled evaluation to accurately assess model reasoning and guide future improvements. Overall, the work advocates a shift from sole reliance on final answers toward nuanced diagnostics that pinpoint where arithmetic errors, not reasoning gaps, limit performance.

Abstract

Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.

Paper Structure

This paper contains 38 sections, 1 equation, 111 figures, 7 tables, 1 algorithm.

Figures (111)

  • Figure 1: Left (Disentangled evaluation framework): Final-answer accuracy obscures reasoning ability due to conflating abstract formulation and arithmetic computation. Right (Abstract-then-Compute Mechanism in Llama-3 8B): (a) Residual stream at the last token position shows that models first capture problem abstraction (L13-14), followed by computation (L18). (b) Same as (a), but one critical layer output is patched with a different symbolic abstraction (e.g., $x-y$), causally changing the computation from $5 + 3 = 8$ to $5 - 3 =2$.
  • Figure 2: Distribution of problem characteristics by number of reasoning steps (GSM8K) and presence of distractors (SVAMP).
  • Figure 3: Model zero-shot without CoT performance on GSM8K. (i) Models exhibit much better abstraction performance (Symbolic and Numerical) than in actually computing the expressions (Arithmetic Computation). (ii) Final-answer accuracy in the Original setting may provide a misleading picture of models' reasoning ability, possibly due to arithmetic limitations.
  • Figure 4: Overview of interpretability methods probing the abstract-then-compute mechanism in simple math problems, focusing on hidden states at the last token position across layers.
  • Figure 5: Attention
  • ...and 106 more figures