Table of Contents
Fetching ...

Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition

Rishi Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, Luc De Raedt

TL;DR

Problem: Do LLMs learn to reason or rely on statistical cues in reasoning tasks? Approach: A principled evaluation framework based on random 3-SAT phase transitions, with SAT-Menu and SAT-CNF prompts, tested across multiple LLMs including DeepSeek R1, and ground-truth validation via MiniSAT. Findings: Accuracy drops in hard instances near the phase transition; DeepSeek R1 shows indications of learned, structured search, while other models rely more on heuristics; coupling LLMs with external solvers can push performance toward near-perfect. Significance: Provides a rigorous, theory-informed benchmark for reasoning in LLMs and points to neurosymbolic strategies to achieve reliable, scalable reasoning.

Abstract

Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities. In theory, autoregressive LLMs with Chain-of-Thought (CoT) can perform more serial computations to solve complex reasoning tasks. However, recent studies suggest that, despite this capacity, LLMs do not truly learn to reason but instead fit on statistical features. To study the reasoning capabilities in a principled fashion, we adopt a computational theory perspective and propose an experimental protocol centered on 3-SAT -- the prototypical NP-complete problem lying at the core of logical reasoning and constraint satisfaction tasks. Specifically, we examine the phase transitions in random 3-SAT and characterize the reasoning abilities of state-of-the-art LLMs by varying the inherent hardness of the problem instances. By comparing DeepSeek R1 with other LLMs, our findings reveal two key insights (1) LLM accuracy drops significantly on harder instances, suggesting all current models struggle when statistical shortcuts are unavailable (2) Unlike other LLMs, R1 shows signs of having learned the underlying reasoning. Following a principled experimental protocol, our study moves beyond the benchmark-driven evidence often found in LLM reasoning research. Our findings highlight important gaps and suggest clear directions for future research.

Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition

TL;DR

Problem: Do LLMs learn to reason or rely on statistical cues in reasoning tasks? Approach: A principled evaluation framework based on random 3-SAT phase transitions, with SAT-Menu and SAT-CNF prompts, tested across multiple LLMs including DeepSeek R1, and ground-truth validation via MiniSAT. Findings: Accuracy drops in hard instances near the phase transition; DeepSeek R1 shows indications of learned, structured search, while other models rely more on heuristics; coupling LLMs with external solvers can push performance toward near-perfect. Significance: Provides a rigorous, theory-informed benchmark for reasoning in LLMs and points to neurosymbolic strategies to achieve reliable, scalable reasoning.

Abstract

Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities. In theory, autoregressive LLMs with Chain-of-Thought (CoT) can perform more serial computations to solve complex reasoning tasks. However, recent studies suggest that, despite this capacity, LLMs do not truly learn to reason but instead fit on statistical features. To study the reasoning capabilities in a principled fashion, we adopt a computational theory perspective and propose an experimental protocol centered on 3-SAT -- the prototypical NP-complete problem lying at the core of logical reasoning and constraint satisfaction tasks. Specifically, we examine the phase transitions in random 3-SAT and characterize the reasoning abilities of state-of-the-art LLMs by varying the inherent hardness of the problem instances. By comparing DeepSeek R1 with other LLMs, our findings reveal two key insights (1) LLM accuracy drops significantly on harder instances, suggesting all current models struggle when statistical shortcuts are unavailable (2) Unlike other LLMs, R1 shows signs of having learned the underlying reasoning. Following a principled experimental protocol, our study moves beyond the benchmark-driven evidence often found in LLM reasoning research. Our findings highlight important gaps and suggest clear directions for future research.

Paper Structure

This paper contains 24 sections, 3 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: The 3-SAT problem, visualized using a variant of the SAT game sat_game. In SAT, the goal is to return a truth assignment to Boolean variables that satisfies a Boolean formula in conjunctive normal form (CNF), or return unSAT if none exists. In the visualization, each row represents a clause -- a disjunction (logical OR, $\lor$) of literals, where each literal is either positive ($X_1$) or negative ($\neg X_1$). A clause is satisfied if at least one of its literals is assigned true. Clauses are joined by logical AND ($\land$), so all must be satisfied for the formula to hold. If no such assignment exists, the formula is unsatisfiable.
  • Figure 2: [Left]: Random 3-SAT Phase Transitionswhere_the_hard_problems_are. Plotted in red is the probability of a randomly sampled 3-SAT formula being satisfied against the hardness $\alpha$ of the formula. We can observe a clear phase transition occurring at $\alpha_c\approx 4.267$ (marked by a green - -). We identify two Easy regions, one on either side of $\alpha_c$. The gray area in the middle denotes the Hard region. The boundaries of the hard region are defined where the probability of the formula being satisfied ceases to be deterministically 1 (left) or 0 (right). The solid blue line shows the mean time taken by the MiniSAT solver to solve 3-SAT instances. Notably, there is a spike in the solver's runtime near $\alpha_c$. This is due to the absence of useful heuristics in this region, forcing the solver to resort to exhaustive searches. [Right]: DPLL Search Trace from a SAT Solver for the given formula. Red boxes denote unsatisfiable assignments; the green box highlights a satisfying one. $0$, $1$ are truth assignments.
  • Figure 3: [Left] 3-SAT performance comparison for the search version of SAT-Menu. [Right] Accuracy vs. satisfiability ratio on the search version of SAT-Menu. We only include satisfiable instances and analyze hard (solid line) and easy regions (dashed line) separately.
  • Figure 4: Failure Cases: SAT-CNF traces for DeepSeek-R1 and GPT-4o. Although the input formula is satisfiable, both models incorrectly predict it as unsatisfiable. Colored boxes indicate model behaviors: cyan for heuristic variable selection, orange - - - for mistakes, green … for backtracking, yellow for self-reflection, and violet for self-correction. Left branch always represents a True assignment. $\perp$ marks unsatisfiability, and ? indicates an unexplored subtree. Integers are denoted as variables (10 $\rightarrow \mathit{X}_{10}$). Numbers show the order of steps. The input formula is in CNF, where each list of integers represents a clause (e.g., [-9, 10, -7] $\mapsto (\neg \mathit{X}_9 \vee \mathit{X}_{10} \vee \neg \mathit{X}_7)$), and the full formula is a conjunction ($\wedge$) of these clauses.
  • Figure 5: Failure Cases: SAT-Menu traces for DeepSeek-R1. Although the input formula is unsatisfiable, R1 incorrectly predicts it as satisfiable. Colored boxes indicate model behaviors: cyan for heuristic variable selection, orange - - - for mistakes, green … for backtracking, yellow for self-reflection, violet for self-correction, and magenta for local search. Left branch always represents an assignment to the orderable list and vice versa. $\perp$ marks unsatisfiability. Numbers show the order of steps.
  • ...and 9 more figures