Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

Xinyan Jiang; Ninghao Liu; Di Wang; Lijie Hu

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

Xinyan Jiang, Ninghao Liu, Di Wang, Lijie Hu

TL;DR

TRACED is introduced, a framework that assesses reasoning quality through theoretically grounded geometric kinematics, and decomposing reasoning traces into Progress (displacement) and Stability (curvature), which reveals a distinct topological divergence.

Abstract

Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematics. By decomposing reasoning traces into Progress (displacement) and Stability (curvature), we reveal a distinct topological divergence: correct reasoning manifests as high-progress, stable trajectories, whereas hallucinations are characterized by low-progress, unstable patterns (stalled displacement with high curvature fluctuations). Leveraging these signatures, our probabilistic framework achieves competitive performance and superior robustness across diverse benchmarks. Crucially, TRACED bridges geometry and cognition by mapping high curvature to ''Hesitation Loops'' and displacement to ''Certainty Accumulation'', offering a physical lens to decode the internal dynamics of machine thought.

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

TL;DR

Abstract

Paper Structure (65 sections, 3 theorems, 22 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 65 sections, 3 theorems, 22 equations, 12 figures, 10 tables, 1 algorithm.

Introduction
Preliminaries
Reasoning as a Trajectory in Latent Space
The Execution Manifold and Semantic Geometry
Method: TRACED
Constructing the Reasoning Quality Space
Geometric Signatures of Reasoning Quality
Bayesian Assessment of Reasoning Quality
Experiments
Experimental Setup
Main Results
Robustness and Efficiency.
Component Ablation and Hyperparameter Sensitivity.
Kinematic Scaling Laws of Reasoning
Geometric Differences Across Domains
...and 50 more sections

Key Result

Theorem 5.4

Under Assumption assump:high_snr, as $\sigma \to 0$, the reasoning trajectory exhibits linear displacement growth. The expected displacement scales linearly with time step $T$, and the local curvature vanishes: In the context of empirical scaling laws, this implies a structurally directed trajectory where the log-log slope of displacement versus time is approximately 1:

Figures (12)

Figure 1: Topological Divergence of Reasoning Quality. Joint distribution of cumulative displacement ($M$) and curvature ($K$) across Structured and Open-Ended domains. The visualization confirms a consistent separation: correct reasoning traces (blue) exhibit a high-displacement, low-curvature pattern, while incorrect chains (red) are characterized by low-displacement stagnation and high-curvature oscillations.
Figure 2: Universality and Generalization Analysis.(a) Universal Signature: A single global fit model derived from aggregated data achieves competitive AUPR across diverse tasks, supporting the existence of a task-agnostic geometric signature. (b) Cross-Domain Adaptation: Dumbbell plot comparing Direct Zero-shot Transfer (blue circles), Aligned Transfer (purple squares), and Supervised In-domain Upper Bound (red stars). Results confirm that the geometric alignment significantly bridges the performance gap caused by distribution shifts.
Figure 3: Robustness and Efficiency.(Left) Class Imbalance: TRACED maintains discriminative stability against distributional shifts, specifically where the prior $P(y_n=1) \in [0.3, 0.7]$. (Right) Data Efficiency: The method achieves rapid geometric convergence, reaching a stability plateau with merely $N \approx 400$ reference samples.
Figure 4: Sensitivity to Subspace Dimension $k$. AUROC evaluation across four models ($k \in [2, 10]$) shows performance improves and stabilizes at $k=8$. Additional metrics (AUPR, FPR@95) are detailed in Appendix \ref{['app:sensitivity_k']}.
Figure 5: Kinematic Scaling Laws of Reasoning. Log-log plot of Net Displacement $D(t) = ||z_T - z_0||_2$ vs. reasoning length across six domains. Blue (Correct): Exhibits linear scaling ($slope \approx 0.82$), characteristic of directed evolution ($D \propto T$) where computation yields direct semantic progress. Red (Incorrect): Follows sub-linear scaling ($slope \approx 0.53$), resembling random walk ($D \propto \sqrt{T}$) and indicating progress stagnation. Shaded regions denote standard deviation.
...and 7 more figures

Theorems & Definitions (9)

Definition 5.1: Stochastic Reasoning Dynamics
Definition 5.2: Net Displacement
Theorem 5.4: Linear Displacement Scaling and Minimal Curvature
proof
Remark 5.5: Directedness of Valid Reasoning
Lemma 5.7: High-Dimensional Orthogonality vershynin2018high
Theorem 5.8: Sub-linear Displacement Scaling and Maximal Curvature
proof
Remark 5.9: Thinking Duration vs. Reasoning Progress

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

TL;DR

Abstract

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (9)