Table of Contents
Fetching ...

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

Xinyan Jiang, Ninghao Liu, Di Wang, Lijie Hu

TL;DR

TRACED is introduced, a framework that assesses reasoning quality through theoretically grounded geometric kinematics, and decomposing reasoning traces into Progress (displacement) and Stability (curvature), which reveals a distinct topological divergence.

Abstract

Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematics. By decomposing reasoning traces into Progress (displacement) and Stability (curvature), we reveal a distinct topological divergence: correct reasoning manifests as high-progress, stable trajectories, whereas hallucinations are characterized by low-progress, unstable patterns (stalled displacement with high curvature fluctuations). Leveraging these signatures, our probabilistic framework achieves competitive performance and superior robustness across diverse benchmarks. Crucially, TRACED bridges geometry and cognition by mapping high curvature to ''Hesitation Loops'' and displacement to ''Certainty Accumulation'', offering a physical lens to decode the internal dynamics of machine thought.

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

TL;DR

TRACED is introduced, a framework that assesses reasoning quality through theoretically grounded geometric kinematics, and decomposing reasoning traces into Progress (displacement) and Stability (curvature), which reveals a distinct topological divergence.

Abstract

Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematics. By decomposing reasoning traces into Progress (displacement) and Stability (curvature), we reveal a distinct topological divergence: correct reasoning manifests as high-progress, stable trajectories, whereas hallucinations are characterized by low-progress, unstable patterns (stalled displacement with high curvature fluctuations). Leveraging these signatures, our probabilistic framework achieves competitive performance and superior robustness across diverse benchmarks. Crucially, TRACED bridges geometry and cognition by mapping high curvature to ''Hesitation Loops'' and displacement to ''Certainty Accumulation'', offering a physical lens to decode the internal dynamics of machine thought.
Paper Structure (65 sections, 3 theorems, 22 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 65 sections, 3 theorems, 22 equations, 12 figures, 10 tables, 1 algorithm.

Key Result

Theorem 5.4

Under Assumption assump:high_snr, as $\sigma \to 0$, the reasoning trajectory exhibits linear displacement growth. The expected displacement scales linearly with time step $T$, and the local curvature vanishes: In the context of empirical scaling laws, this implies a structurally directed trajectory where the log-log slope of displacement versus time is approximately 1:

Figures (12)

  • Figure 1: Topological Divergence of Reasoning Quality. Joint distribution of cumulative displacement ($M$) and curvature ($K$) across Structured and Open-Ended domains. The visualization confirms a consistent separation: correct reasoning traces (blue) exhibit a high-displacement, low-curvature pattern, while incorrect chains (red) are characterized by low-displacement stagnation and high-curvature oscillations.
  • Figure 2: Universality and Generalization Analysis.(a) Universal Signature: A single global fit model derived from aggregated data achieves competitive AUPR across diverse tasks, supporting the existence of a task-agnostic geometric signature. (b) Cross-Domain Adaptation: Dumbbell plot comparing Direct Zero-shot Transfer (blue circles), Aligned Transfer (purple squares), and Supervised In-domain Upper Bound (red stars). Results confirm that the geometric alignment significantly bridges the performance gap caused by distribution shifts.
  • Figure 3: Robustness and Efficiency.(Left) Class Imbalance: TRACED maintains discriminative stability against distributional shifts, specifically where the prior $P(y_n=1) \in [0.3, 0.7]$. (Right) Data Efficiency: The method achieves rapid geometric convergence, reaching a stability plateau with merely $N \approx 400$ reference samples.
  • Figure 4: Sensitivity to Subspace Dimension $k$. AUROC evaluation across four models ($k \in [2, 10]$) shows performance improves and stabilizes at $k=8$. Additional metrics (AUPR, FPR@95) are detailed in Appendix \ref{['app:sensitivity_k']}.
  • Figure 5: Kinematic Scaling Laws of Reasoning. Log-log plot of Net Displacement $D(t) = ||z_T - z_0||_2$ vs. reasoning length across six domains. Blue (Correct): Exhibits linear scaling ($slope \approx 0.82$), characteristic of directed evolution ($D \propto T$) where computation yields direct semantic progress. Red (Incorrect): Follows sub-linear scaling ($slope \approx 0.53$), resembling random walk ($D \propto \sqrt{T}$) and indicating progress stagnation. Shaded regions denote standard deviation.
  • ...and 7 more figures

Theorems & Definitions (9)

  • Definition 5.1: Stochastic Reasoning Dynamics
  • Definition 5.2: Net Displacement
  • Theorem 5.4: Linear Displacement Scaling and Minimal Curvature
  • proof
  • Remark 5.5: Directedness of Valid Reasoning
  • Lemma 5.7: High-Dimensional Orthogonality vershynin2018high
  • Theorem 5.8: Sub-linear Displacement Scaling and Maximal Curvature
  • proof
  • Remark 5.9: Thinking Duration vs. Reasoning Progress