Table of Contents
Fetching ...

Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties

Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

TL;DR

The paper introduces reasoning graphs extracted from hidden states of large reasoning models, and analyzes three graph-theoretic properties—cyclicity, graph diameter, and small-world index—to understand reasoning mechanisms. By clustering segment representations and tracing sequential node visits, the authors compare base and large reasoning models across GSM8K, MATH500, and AIME 2024, finding that larger models exhibit about $5$ cycles per sample, much larger diameters, and pronounced small-world characteristics (roughly $ imes 6$) that correlate with accuracy. The study further shows that supervised fine-tuning on improved datasets expands reasoning graph diameters and enhances performance, offering concrete data-construction guidelines to boost reasoning. Collectively, these results link internal graph-structural properties to empirical reasoning gains, informing interpretability and training-data design for advanced LLMs.

Abstract

Recent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In this work, we introduce the notion of a reasoning graph, extracted by clustering hidden-state representations at each reasoning step, and systematically analyze three key graph-theoretic properties: cyclicity, diameter, and small-world index, across multiple tasks (GSM8K, MATH500, AIME 2024). Our findings reveal that distilled reasoning models (e.g., DeepSeek-R1-Distill-Qwen-32B) exhibit significantly more recurrent cycles (about 5 per sample), substantially larger graph diameters, and pronounced small-world characteristics (about 6x) compared to their base counterparts. Notably, these structural advantages grow with task difficulty and model capacity, with cycle detection peaking at the 14B scale and exploration diameter maximized in the 32B variant, correlating positively with accuracy. Furthermore, we show that supervised fine-tuning on an improved dataset systematically expands reasoning graph diameters in tandem with performance gains, offering concrete guidelines for dataset design aimed at boosting reasoning capabilities. By bridging theoretical insights into reasoning graph structures with practical recommendations for data construction, our work advances both the interpretability and the efficacy of large reasoning models.

Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties

TL;DR

The paper introduces reasoning graphs extracted from hidden states of large reasoning models, and analyzes three graph-theoretic properties—cyclicity, graph diameter, and small-world index—to understand reasoning mechanisms. By clustering segment representations and tracing sequential node visits, the authors compare base and large reasoning models across GSM8K, MATH500, and AIME 2024, finding that larger models exhibit about cycles per sample, much larger diameters, and pronounced small-world characteristics (roughly ) that correlate with accuracy. The study further shows that supervised fine-tuning on improved datasets expands reasoning graph diameters and enhances performance, offering concrete data-construction guidelines to boost reasoning. Collectively, these results link internal graph-structural properties to empirical reasoning gains, informing interpretability and training-data design for advanced LLMs.

Abstract

Recent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In this work, we introduce the notion of a reasoning graph, extracted by clustering hidden-state representations at each reasoning step, and systematically analyze three key graph-theoretic properties: cyclicity, diameter, and small-world index, across multiple tasks (GSM8K, MATH500, AIME 2024). Our findings reveal that distilled reasoning models (e.g., DeepSeek-R1-Distill-Qwen-32B) exhibit significantly more recurrent cycles (about 5 per sample), substantially larger graph diameters, and pronounced small-world characteristics (about 6x) compared to their base counterparts. Notably, these structural advantages grow with task difficulty and model capacity, with cycle detection peaking at the 14B scale and exploration diameter maximized in the 32B variant, correlating positively with accuracy. Furthermore, we show that supervised fine-tuning on an improved dataset systematically expands reasoning graph diameters in tandem with performance gains, offering concrete guidelines for dataset design aimed at boosting reasoning capabilities. By bridging theoretical insights into reasoning graph structures with practical recommendations for data construction, our work advances both the interpretability and the efficacy of large reasoning models.

Paper Structure

This paper contains 35 sections, 3 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Illustration of the concept of reasoning graphs, comparing base models and large reasoning models. Nodes represent simple computational states (e.g., calculation steps shown on the left), with paths leading to the final answer constituting the reasoning graph. We analyze graph-theoretic properties of reasoning graphs, including cyclic structures, diameter, and small-world characteristics. Examining these structural distinctions enables us to better understand and recent performance improvements in challenging mathematical tasks.
  • Figure 2: (a) Illustration of the methodology used to extract reasoning graphs from LLMs. (b) Representative nodes obtained from clustering the DeepSeek-R1-Distill-Qwen-32B using GSM8K dataset.
  • Figure 3: Visualization of reasoning graphs on GSM8K dataset using t-SNE embeddings. The upper row shows graphs from base model (Qwen2.5-32B), while the lower row represents those from the large reasoning model (DeepSeek-R1-Distill-Qwen-32B). Compared to the base model, the reasoning model exhibits qualitatively broader exploration with notably more cycles in its reasoning graphs.
  • Figure 4: Comparison of cycle detection ratios across different layers in the large reasoning model (DeepSeek-R1-Distill-Qwen-32B) and the base model (Qwen2.5-32B), evaluated on three tasks: (a) GSM8K, (b) MATH500, and (c) AIME 2024. Results consistently show that the large reasoning model exhibits significantly higher cycle detection ratios than the base model at all layer ratios and tasks. Additionally, a trend emerges, indicating that the cycle detection ratio increases as task difficulty escalates from GSM8K through MATH500 to AIME 2024.
  • Figure 5: (a) Distribution of cycle counts for the large reasoning model (DeepSeek-R1-Distill-Qwen-32B) and the base model (Qwen2.5-32B) across various hidden layer depths. The reasoning model exhibits significantly higher cycle counts. (b) Distribution of reasoning graph diameters across various hidden layer depths. The diameter of reasoning graphs increases progressively with deeper layers. The reasoning model demonstrates significantly larger graph diameters, indicating a broader exploration space compared to the base model.
  • ...and 11 more figures