Table of Contents
Fetching ...

Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models

Javier González, Aditya V. Nori

TL;DR

This work addresses whether large language models truly reason by framing reasoning as probabilistic causation using PN and PS. It introduces a Hex-based abstract-machine framework in which problems are solved via a three-step prompt–latent-state–output pipeline, and links these processes to causal models with counterfactual data generation. By comparing true PN/PS values with LLM-derived estimates across several math problems, the paper demonstrates that reasoning-like behavior emerges only partially, most clearly for Div6 with GPT-4, and highlights limitations related to counterfactual consistency and data quality. The findings suggest that while larger LLMs show improved alignment with causal reasoning, reliable, general-purpose reasoning remains an open challenge with important implications for safety and deployment.

Abstract

Recent advances in AI have been significantly driven by the capabilities of large language models (LLMs) to solve complex problems in ways that resemble human thinking. However, there is an ongoing debate about the extent to which LLMs are capable of actual reasoning. Central to this debate are two key probabilistic concepts that are essential for connecting causes to their effects: the probability of necessity (PN) and the probability of sufficiency (PS). This paper introduces a framework that is both theoretical and practical, aimed at assessing how effectively LLMs are able to replicate real-world reasoning mechanisms using these probabilistic measures. By viewing LLMs as abstract machines that process information through a natural language interface, we examine the conditions under which it is possible to compute suitable approximations of PN and PS. Our research marks an important step towards gaining a deeper understanding of when LLMs are capable of reasoning, as illustrated by a series of math examples.

Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models

TL;DR

This work addresses whether large language models truly reason by framing reasoning as probabilistic causation using PN and PS. It introduces a Hex-based abstract-machine framework in which problems are solved via a three-step prompt–latent-state–output pipeline, and links these processes to causal models with counterfactual data generation. By comparing true PN/PS values with LLM-derived estimates across several math problems, the paper demonstrates that reasoning-like behavior emerges only partially, most clearly for Div6 with GPT-4, and highlights limitations related to counterfactual consistency and data quality. The findings suggest that while larger LLMs show improved alignment with causal reasoning, reliable, general-purpose reasoning remains an open challenge with important implications for safety and deployment.

Abstract

Recent advances in AI have been significantly driven by the capabilities of large language models (LLMs) to solve complex problems in ways that resemble human thinking. However, there is an ongoing debate about the extent to which LLMs are capable of actual reasoning. Central to this debate are two key probabilistic concepts that are essential for connecting causes to their effects: the probability of necessity (PN) and the probability of sufficiency (PS). This paper introduces a framework that is both theoretical and practical, aimed at assessing how effectively LLMs are able to replicate real-world reasoning mechanisms using these probabilistic measures. By viewing LLMs as abstract machines that process information through a natural language interface, we examine the conditions under which it is possible to compute suitable approximations of PN and PS. Our research marks an important step towards gaining a deeper understanding of when LLMs are capable of reasoning, as illustrated by a series of math examples.
Paper Structure (26 sections, 1 theorem, 17 equations, 12 figures)

This paper contains 26 sections, 1 theorem, 17 equations, 12 figures.

Key Result

Lemma 1

Let $\mathcal{M}_{\mathcal{V}}$, with variables $\mathcal{V} = \{X, Y, Z\}$, be a structural causal model for a problem $(Q, \sigma_0)$, and let $M$ be an LLM that generates counterfactuals for $Y$. Then $M$ is $\beta$-counterfactual consistent with $\mathcal{M}_{\mathcal{V}}$ if and only if its ass

Figures (12)

  • Figure 1: Illustration of the actual vs. perceived reasoning abilities of GPT-2, GPT-35-turbo and GPT-4 for a simple arithmetic problem. We posed two distinct types of questions (direct and counterfactual) to the models, each repeated 10 times, for every {number} from 1 to 50. All three models showed an inflated sense of reasoning capability when answering the direct questions. The discrepancy is especially pronounced in GPT-35-turbo, which performed nearly flawlessly on direct questions, but experienced a surge in error rate, exceeding 25%, when handling counterfactual questions.
  • Figure 2: Reasoning test for assessing an LLM's reasoning abilities. A) Divisibility rule and the corresponding reasoning graph. B) Dataset generation for computing PN and PS. C) Analysis comparing actual values of PN and PS with PN and PS estimates for the LLM-generated data.
  • Figure 3: The Hex diagram depicts two approaches for solving the problem $(Q, \sigma_0)$ outlined in Example \ref{['prob:modulo']}. The dotted path corresponds to the actual process of solving the problem, while the solid path represents the one taken by the LLM.
  • Figure 4: Left: Contingency tables for $\mathcal{D}_F$, $\mathcal{D}_{CF}$ and $\mathcal{D}_{CF}^{\mathrm{GPT-4}}$ in Example \ref{['prob:modulo']}. Right: Reasoning graphs for the other math problems in this paper. C-type nodes in the graph represent boolean conditions. See Appendix \ref{['ap:equations']} for details.
  • Figure 5: Left: Heatmaps comparing the consistency of data generated by GPT-2, GPT-3.5-turbo, and GPT-4 for the Div6 problem. Each heatmap cell represents the error rate of the corresponding model for each element of the problem across 10 replicated tests. Right: Sensitivity of the simulated PN relative to varying levels of random noise introduced in the true counterfactuals.
  • ...and 7 more figures

Theorems & Definitions (12)

  • Example 1
  • Definition 1: Counterfactual query
  • Definition 2: Counterfactual prompt
  • Definition 3: $\beta$-counterfactual consistency
  • Lemma 1
  • Definition 4: Causal Model
  • Definition 5: Intervention, $\mathit{do}$ operator
  • Definition 6: Potential outcome and counterfactual
  • Definition 7: Probability of necessity, pearl1999probabilities
  • Definition 8: Probability of sufficiency, pearl1999probabilities
  • ...and 2 more