Can Large Language Models Reason? A Characterization via 3-SAT

Rishi Hazra; Gabriele Venturato; Pedro Zuidberg Dos Martires; Luc De Raedt

Can Large Language Models Reason? A Characterization via 3-SAT

Rishi Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, Luc De Raedt

TL;DR

This work examines the phase transitions in random 3-SAT and characterize the reasoning abilities of LLMs by varying the inherent hardness of the problem instances, and shows that integrating external reasoners can considerably enhance LLM performance.

Abstract

Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities. However, recent works have shown that LLMs often bypass true reasoning using shortcuts, sparking skepticism. To study the reasoning capabilities in a principled fashion, we adopt a computational theory perspective and propose an experimental protocol centered on 3-SAT -- the prototypical NP-complete problem lying at the core of logical reasoning and constraint satisfaction tasks. Specifically, we examine the phase transitions in random 3-SAT and characterize the reasoning abilities of LLMs by varying the inherent hardness of the problem instances. Our experimental evidence shows that LLMs are incapable of performing true reasoning, as required for solving 3-SAT problems. Moreover, we observe significant performance variation based on the inherent hardness of the problems -- performing poorly on harder instances and vice versa. Importantly, we show that integrating external reasoners can considerably enhance LLM performance. By following a principled experimental protocol, our study draws concrete conclusions and moves beyond the anecdotal evidence often found in LLM reasoning research.

Can Large Language Models Reason? A Characterization via 3-SAT

TL;DR

Abstract

Paper Structure (16 sections, 3 equations, 12 figures, 2 tables)

This paper contains 16 sections, 3 equations, 12 figures, 2 tables.

Introduction
Preliminaries
Phase Transitions in Random 3-SAT
SAT Solvers
Related Work
Methodology
Using LLMs as 3-SAT Solvers
Dataset Generation
Results
Can LLMs solve 3-SAT problems?
Can LLM-Modulo frameworks boost performance?
Discussion
Conclusion
Dataset Statistics
LLMs as Solvers: Failure Cases
...and 1 more sections

Figures (12)

Figure 1: The 3-SAT problem, visualized using a variant of the SAT game sat_game. In SAT, one must return a truth assignment to Boolean variables that satisfy a Boolean formula in conjunctive normal form (CNF) if one exists, and return unSAT otherwise. A row in the visualization represents a clause, which is a disjunction (connected by a logical OR $\lor$) of literals, wherein a literal can be positive ($X_1$) or negative ($\neg X_1$). An assignment satisfies a clause if one of its literals is assigned the value true. Since clauses are conjunctively connected (by a logical AND $\land$), all clauses must be satisfied for the formula to be satisfied. If no satisfying assignment exists, the formula is unsatisfiable.
Figure 2: Random 3-SAT Phase Transitionswhere_the_hard_problems_are. Plotted in red is the probability of a randomly sampled 3-SAT formula being satisfied against the hardness $\alpha$ of the formula. We can observe a clear phase transition occurring at $\alpha_c\approx 4.267$ (marked by the green dotted line). We identify two easy regions, one on either side of $\alpha_c$. The gray area in the middle denotes the hard region. The boundaries of the hard region are defined where the probability of the formula being satisfied ceases to be deterministically one (left) or zero (right). The solid blue line shows the mean time taken by the MiniSAT solver minisat to solve a 3-SAT instance. Notably, there is a spike in the solver's runtime near the critical $\alpha_c$ value. This is attributed to the absence of useful heuristics in this region, forcing the solver to resort to essentially exhaustive searches.
Figure 3: Accuracy of GPT-4 against $\alpha$ and the satisfiability ratio. [Left] Accuracy of GPT-4 against $\alpha$ for both SAT Decision and SAT Search. Notably, GPT-4's accuracy for both variants exhibits a significant decline around the critical point ($\alpha_c \approx 4.267$) aligning with the hard region. The dip in the performance mimics the increased solving time of the MiniSAT solver in the hard region. The setting is SAT-CNF. [Right] Accuracy of GPT-4 against the satisfiability ratio. We only include satisfiable instances and analyse problems from the hard and easy regions separately. Note that in our dataset, the 3-SAT problems exhibited maximum satisfiability ratios of approximately $0.4$ for the hard region and $0.62$ for the easy regions. The plots were generated using a size 4 moving window on $\alpha$ values.
Figure 4: Comparing GPT-4 with SOTA LLMs. [Left] Phase transition characteristics for all LLMs. [Right] Comparison of unSAT accuracy (i.e. accuracy of correctly predicting unsatisfiable problems) on 3-SAT decision problem. GPT-4 outperforms all other models in detecting unsatisfiable problems. Both plots are on the search version of SAT-Menu setup with zero-shot prompting. The left plot was generated using a size 4 moving window on $\alpha$ values.
Figure 5: [Left] The figure compares LLM-Modulo frameworks (denoted as SAT-Translate) -- here, GPT-4 equipped with a solver -- with standalone GPT-4 performance with SAT-Menu and SAT-CNF inputs. SAT-Translate approach (plotted in green) outperforms the rest, showing the significance of augmenting LLMs with symbolic solvers. [Right] The figure shows how the number of generated tokens by GPT-4 varies with the hardness ($\alpha$) of the problem -- increasing in the hard region and vice versa.
...and 7 more figures

Can Large Language Models Reason? A Characterization via 3-SAT

TL;DR

Abstract

Can Large Language Models Reason? A Characterization via 3-SAT

Authors

TL;DR

Abstract

Table of Contents

Figures (12)