Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models
Javier González, Aditya V. Nori
TL;DR
This work addresses whether large language models truly reason by framing reasoning as probabilistic causation using PN and PS. It introduces a Hex-based abstract-machine framework in which problems are solved via a three-step prompt–latent-state–output pipeline, and links these processes to causal models with counterfactual data generation. By comparing true PN/PS values with LLM-derived estimates across several math problems, the paper demonstrates that reasoning-like behavior emerges only partially, most clearly for Div6 with GPT-4, and highlights limitations related to counterfactual consistency and data quality. The findings suggest that while larger LLMs show improved alignment with causal reasoning, reliable, general-purpose reasoning remains an open challenge with important implications for safety and deployment.
Abstract
Recent advances in AI have been significantly driven by the capabilities of large language models (LLMs) to solve complex problems in ways that resemble human thinking. However, there is an ongoing debate about the extent to which LLMs are capable of actual reasoning. Central to this debate are two key probabilistic concepts that are essential for connecting causes to their effects: the probability of necessity (PN) and the probability of sufficiency (PS). This paper introduces a framework that is both theoretical and practical, aimed at assessing how effectively LLMs are able to replicate real-world reasoning mechanisms using these probabilistic measures. By viewing LLMs as abstract machines that process information through a natural language interface, we examine the conditions under which it is possible to compute suitable approximations of PN and PS. Our research marks an important step towards gaining a deeper understanding of when LLMs are capable of reasoning, as illustrated by a series of math examples.
