Table of Contents
Fetching ...

Effective Reasoning Chains Reduce Intrinsic Dimensionality

Archiki Prasad, Mandar Joshi, Kenton Lee, Mohit Bansal, Peter Shaw

TL;DR

This work introduces intrinsic dimensionality as a quantitative lens to understand why chain-of-thought reasoning improves generalization. By holding the model fixed and varying the task via diverse reasoning strategies, the authors show that more effective reasoning reduces the minimum trainable parameter count needed to reach a performance threshold, and that this reduction strongly predicts both ID and OOD generalization on GSM8K with Gemma-3 models. Across 1B and 4B variants, intrinsic dimensionality outperforms trajectory length and KL-based metrics as a predictor of generalization, with Executed PoT emerging as a particularly low-dim and highly generalizable strategy. The findings suggest that effective reasoning compresses the task into a lower-dimensional representation, offering a principled, MDL-inspired explanation for generalization gains and guiding future design of reasoning strategies and data annotation.

Abstract

Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.

Effective Reasoning Chains Reduce Intrinsic Dimensionality

TL;DR

This work introduces intrinsic dimensionality as a quantitative lens to understand why chain-of-thought reasoning improves generalization. By holding the model fixed and varying the task via diverse reasoning strategies, the authors show that more effective reasoning reduces the minimum trainable parameter count needed to reach a performance threshold, and that this reduction strongly predicts both ID and OOD generalization on GSM8K with Gemma-3 models. Across 1B and 4B variants, intrinsic dimensionality outperforms trajectory length and KL-based metrics as a predictor of generalization, with Executed PoT emerging as a particularly low-dim and highly generalizable strategy. The findings suggest that effective reasoning compresses the task into a lower-dimensional representation, offering a principled, MDL-inspired explanation for generalization gains and guiding future design of reasoning strategies and data annotation.

Abstract

Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.
Paper Structure (38 sections, 3 equations, 3 figures, 5 tables)

This paper contains 38 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview. Middle (Green): We calculate the intrinsic dimension of a reasoning strategy as described in \ref{['ssec:method']}, and then compare how well its predicts the generalization performance of models trained with different reasoning strategies (top; c.f. \ref{['ssec:expt']}). On the right, we demonstrate a strong inverse correlation between intrinsic dimensionality and generalization performance (\ref{['sec:results']}).
  • Figure 2: Visualization of intrinsic dimension computation for Gemma-3 4B showing select reasoning strategies. We plot the Pareto frontier of monotonic training accuracy versus trainable parameters (log scale). The dashed line indicates the threshold ($\tau = 63.0\%$); intrinsic dimension is the parameter count where each curve first crosses this threshold (vertical dotted lines). Strategies crossing earlier have lower intrinsic dimensionality and tend to yield higher overall performance (cf. \ref{['tab:4b_results']}).
  • Figure 3: Visualization of intrinsic dimension computation for Gemma-3 1B showing select reasoning strategies. We plot the Pareto frontier of monotonic training accuracy versus trainable parameters (log scale). The dashed line indicates the threshold ($\tau = 24.3\%$); intrinsic dimension is the parameter count where each curve first crosses this threshold (vertical dotted lines).