Table of Contents
Fetching ...

KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?

Soumadeep Saha, Akshay Chaturvedi, Saptarshi Saha, Utpal Garain, Nicholas Asher

TL;DR

This work addresses how chain of thought traces aid mathematical reasoning by proposing a causal graph abstraction of CoT traces and releasing KisMATH, a dataset of 1671 problems paired with LLM solutions and CCGraphs. It introduces a scalable CCGraph construction algorithm, enabling graph aligned interventions that test mediation and causal structure using attention suppression and path probability analyses across 15 open-weight LLMs. The key findings show that reasoning nodes in CCGraphs mediate the final answer and that LLMs preferentially traverse CCGraph aligned reasoning paths, with two distinct behavior regimes observed across models. The results highlight the practical significance of uncovering latent, graph-like structures in LLM reasoning and point to directions for more principled evaluation and intervention of CoT in mathematical domains.

Abstract

Chain-of-thought (CoT) traces have been shown to improve performance of large language models on a plethora of reasoning tasks, yet there is no consensus on the mechanism by which this boost is achieved. To shed more light on this, we introduce Causal CoT Graphs (CCGraphs), which are directed acyclic graphs automatically extracted from reasoning traces that model fine-grained causal dependencies in language-model outputs. A collection of 1671 mathematical reasoning problems from MATH500, GSM8K, and AIME, together with their associated CCGraphs, has been compiled into our dataset -- KisMATH. Our detailed empirical analysis with 15 open-weight LLMs shows that (i) reasoning nodes in the CCGraphs are causal contributors to the final answer, which we argue is constitutive of reasoning; and (ii) LLMs emphasize the reasoning paths captured by the CCGraphs, indicating that the models internally realize structures similar to our graphs. KisMATH enables controlled, graph-aligned interventions and opens avenues for further investigation into the role of CoT in LLM reasoning.

KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?

TL;DR

This work addresses how chain of thought traces aid mathematical reasoning by proposing a causal graph abstraction of CoT traces and releasing KisMATH, a dataset of 1671 problems paired with LLM solutions and CCGraphs. It introduces a scalable CCGraph construction algorithm, enabling graph aligned interventions that test mediation and causal structure using attention suppression and path probability analyses across 15 open-weight LLMs. The key findings show that reasoning nodes in CCGraphs mediate the final answer and that LLMs preferentially traverse CCGraph aligned reasoning paths, with two distinct behavior regimes observed across models. The results highlight the practical significance of uncovering latent, graph-like structures in LLM reasoning and point to directions for more principled evaluation and intervention of CoT in mathematical domains.

Abstract

Chain-of-thought (CoT) traces have been shown to improve performance of large language models on a plethora of reasoning tasks, yet there is no consensus on the mechanism by which this boost is achieved. To shed more light on this, we introduce Causal CoT Graphs (CCGraphs), which are directed acyclic graphs automatically extracted from reasoning traces that model fine-grained causal dependencies in language-model outputs. A collection of 1671 mathematical reasoning problems from MATH500, GSM8K, and AIME, together with their associated CCGraphs, has been compiled into our dataset -- KisMATH. Our detailed empirical analysis with 15 open-weight LLMs shows that (i) reasoning nodes in the CCGraphs are causal contributors to the final answer, which we argue is constitutive of reasoning; and (ii) LLMs emphasize the reasoning paths captured by the CCGraphs, indicating that the models internally realize structures similar to our graphs. KisMATH enables controlled, graph-aligned interventions and opens avenues for further investigation into the role of CoT in LLM reasoning.

Paper Structure

This paper contains 20 sections, 6 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Example of extracted causal graph and paths.(Left) An example of a (simplified) CoT causal graph (CCGraph) extracted from the GSM8K dataset. Reasoning nodes are highlighted in blue, edges are in gray. (Right) An R path (see Eq. \ref{['eq:rpath']}), i.e., a simple path from question to answer (solid line) and a random path (dashed line).
  • Figure 2: Examples of R paths from the MATH500 and AIME splits of the KisMATH dataset. Nodes on the R path ($\hat{q}_{\alpha} \leadsto \hat{r}_{(i_1)} \leadsto \ldots \leadsto \hat{r}_{(i_{\mu})} \leadsto \hat{a}$) are highlighted (see Eq. \ref{['eq:rpath']}).
  • Figure 3: Do reasoning path interventions affect the answer? We find that when attentions corresponding to tokens in an R path are suppressed, the entropy of the distribution of the answer ($H(P_A)$) increases significantly, i.e., uncertainty over the answer is significantly increased. The figure also reports results of the 2-sample KS test, showing high values of Kolmogorov distance ($D_{KS}$) and high statistical significance ($p < 10^{-300}$).
  • Figure 4: Are LLMs aware of implicit structures in reasoning? We compare the probability associated with reasoning paths (see Eq. \ref{['eq:path-prob']}) with the probability of a random path through the reasoning response (e.g. Figure \ref{['fig:example_annotation']}(right)). The graphs show the rank of a reasoning path compared to random paths (see Eq. \ref{['eq:rank']}) for various models. A striking peak is observed at the 100 %-ile region, indicating that a large fraction of reasoning paths entirely consist of higher probability transitions.
  • Figure 5: Analyzing the "bell"-shape. We compare a higher-resolution R path rank-distribution ($\text{rank}_{50}(\mathcal{R})$) for two models exhibiting behavior on the two ends of the spectrum of rank distributions (see Figure \ref{['fig:path-prob-2']}). The model demonstrating "bell"-shape (DeepSeek R1 32B) has lower $P(\mathcal{R})$ scores for some R paths and the scores have higher variance. Results are reported with 100 samples from the AIME split.
  • ...and 6 more figures