Table of Contents
Fetching ...

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary

TL;DR

It is demonstrated that benchmark accuracy can mask computational unreliability, demanding evaluation reforms measuring stability beyond single-sample metrics, and it is highlighted that benchmark accuracy can mask computational unreliability.

Abstract

Mathematical reasoning models are widely deployed in education, automated tutoring, and decision support systems despite exhibiting fundamental computational instabilities. We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61% accuracy through a mixture of reliable and unreliable reasoning pathways: 18.4% of correct predictions employ stable, faithful reasoning while 81.6% emerge through computationally inconsistent pathways. Additionally, 8.8% of all predictions are silent failures -- confident yet incorrect outputs. Through comprehensive analysis using novel faithfulness metrics, we reveal: (1) reasoning quality shows weak negative correlation with correctness (r=-0.21, p=0.002), reflecting a binary classification threshold artifact rather than a monotonic inverse relationship; (2) scaling from 1.5B to 7B parameters (4.7x increase) provides zero accuracy benefit on our evaluated subset (6% of GSM8K), requiring validation on the complete benchmark; and (3) latent reasoning employs diverse computational strategies, with ~20% sharing CoT-like patterns. These findings highlight that benchmark accuracy can mask computational unreliability, demanding evaluation reforms measuring stability beyond single-sample metrics.

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

TL;DR

It is demonstrated that benchmark accuracy can mask computational unreliability, demanding evaluation reforms measuring stability beyond single-sample metrics, and it is highlighted that benchmark accuracy can mask computational unreliability.

Abstract

Mathematical reasoning models are widely deployed in education, automated tutoring, and decision support systems despite exhibiting fundamental computational instabilities. We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61% accuracy through a mixture of reliable and unreliable reasoning pathways: 18.4% of correct predictions employ stable, faithful reasoning while 81.6% emerge through computationally inconsistent pathways. Additionally, 8.8% of all predictions are silent failures -- confident yet incorrect outputs. Through comprehensive analysis using novel faithfulness metrics, we reveal: (1) reasoning quality shows weak negative correlation with correctness (r=-0.21, p=0.002), reflecting a binary classification threshold artifact rather than a monotonic inverse relationship; (2) scaling from 1.5B to 7B parameters (4.7x increase) provides zero accuracy benefit on our evaluated subset (6% of GSM8K), requiring validation on the complete benchmark; and (3) latent reasoning employs diverse computational strategies, with ~20% sharing CoT-like patterns. These findings highlight that benchmark accuracy can mask computational unreliability, demanding evaluation reforms measuring stability beyond single-sample metrics.
Paper Structure (44 sections, 63 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 44 sections, 63 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Main results. (a) Faithfulness components with 0.65 threshold. (b) Implicit vs. explicit CoT depth distributions. (c) Layer-wise activation magnitude with key layers marked. (d) Depth--accuracy relationship. (e) Failure mode distribution. (f) Metric--correctness correlations.
  • Figure 2: Layer causal importance via noise intervention ($N=50$, $\sigma=0.1$). Red: positive importance; blue: negative. Middle layers 6--9 show highest causal importance (mean = 0.011).
  • Figure 3: Detailed analysis. (a) Fidelity vs. correctness with jittered outcomes and trend line. (b) Reasoning depth distribution. (c) Accuracy by difficulty and fidelity level.
  • Figure 4: Supplementary analyses. (a) Thinking token usage by difficulty. (b) Trajectory similarity distributions with 0.7 support threshold. (c) Ablation study results. (d) Information bottleneck analysis with compression layers marked.
  • Figure 5: Comparison of Qwen2.5-Math-7B vs. 1.5B. (a) Identical accuracy. (b) Reasoning depth comparison. (c) Multi-metric comparison (normalized).