Table of Contents
Fetching ...

Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity

Subramanyam Sahoo, Vinija Jain, Saanidhya Vats, Siddharth Mohapatra, Rui Min, Aman Chadha, Divya Chaudhary

TL;DR

The paper tackles the problem that answer accuracy alone can mask flawed mathematical reasoning in language models. It proposes a unified diagnostic framework that evaluates forward/backward consistency, transitivity, counterfactual sensitivity, and perturbation robustness, demonstrated on a 600M-parameter model on MenatQA. Key findings show strong surface performance but poor backward consistency and limited transitivity, revealing reliance on pattern matching rather than genuine computation. The framework is model-agnostic, scalable, and accompanied by open evaluation protocols to push toward verifiable mathematical reasoning rather than superficial plausibility.

Abstract

Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.

Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity

TL;DR

The paper tackles the problem that answer accuracy alone can mask flawed mathematical reasoning in language models. It proposes a unified diagnostic framework that evaluates forward/backward consistency, transitivity, counterfactual sensitivity, and perturbation robustness, demonstrated on a 600M-parameter model on MenatQA. Key findings show strong surface performance but poor backward consistency and limited transitivity, revealing reliance on pattern matching rather than genuine computation. The framework is model-agnostic, scalable, and accompanied by open evaluation protocols to push toward verifiable mathematical reasoning rather than superficial plausibility.

Abstract

Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.

Paper Structure

This paper contains 41 sections, 38 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Hop-Distribution
  • Figure 2: Faithfulness vs Plausibility Analysis
  • Figure 3: Robustness Analysis
  • Figure 4: Transitivity Analysis
  • Figure 5: Counterfactual Reasoning Analysis