Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition
Rishi Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, Luc De Raedt
TL;DR
Problem: Do LLMs learn to reason or rely on statistical cues in reasoning tasks? Approach: A principled evaluation framework based on random 3-SAT phase transitions, with SAT-Menu and SAT-CNF prompts, tested across multiple LLMs including DeepSeek R1, and ground-truth validation via MiniSAT. Findings: Accuracy drops in hard instances near the phase transition; DeepSeek R1 shows indications of learned, structured search, while other models rely more on heuristics; coupling LLMs with external solvers can push performance toward near-perfect. Significance: Provides a rigorous, theory-informed benchmark for reasoning in LLMs and points to neurosymbolic strategies to achieve reliable, scalable reasoning.
Abstract
Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities. In theory, autoregressive LLMs with Chain-of-Thought (CoT) can perform more serial computations to solve complex reasoning tasks. However, recent studies suggest that, despite this capacity, LLMs do not truly learn to reason but instead fit on statistical features. To study the reasoning capabilities in a principled fashion, we adopt a computational theory perspective and propose an experimental protocol centered on 3-SAT -- the prototypical NP-complete problem lying at the core of logical reasoning and constraint satisfaction tasks. Specifically, we examine the phase transitions in random 3-SAT and characterize the reasoning abilities of state-of-the-art LLMs by varying the inherent hardness of the problem instances. By comparing DeepSeek R1 with other LLMs, our findings reveal two key insights (1) LLM accuracy drops significantly on harder instances, suggesting all current models struggle when statistical shortcuts are unavailable (2) Unlike other LLMs, R1 shows signs of having learned the underlying reasoning. Following a principled experimental protocol, our study moves beyond the benchmark-driven evidence often found in LLM reasoning research. Our findings highlight important gaps and suggest clear directions for future research.
