Table of Contents
Fetching ...

Can Transformers Reason Logically? A Study in SAT Solving

Leyan Pan, Vijay Ganesh, Jacob Abernethy, Chris Esposo, Wenke Lee

TL;DR

The paper analyzes whether decoder-only Transformers can perform formal logical reasoning for SAT by constructing a CoT-based solver that simulates DPLL backtracking. It provides both a theoretical construction showing feasibility for $3$-SAT and a practical compiler, PARAT, to instantiate the weights and validate correctness on DIMACS 3-SAT instances. Empirically, training on CoT traces yields strong out-of-distribution generalization for the same problem length but fails to generalize to longer instance sizes, highlighting fundamental length-generalization limits. The work advances theoretical foundations for Transformer reasoning in formal domains and offers a practical toolchain for compiling and evaluating such constructions.

Abstract

We formally study the logical reasoning capabilities of decoder-only Transformers in the context of the boolean satisfiability (SAT) problem. First, we prove by construction that decoder-only Transformers can decide 3-SAT, in a non-uniform model of computation, using backtracking and deduction via Chain-of-Thought (CoT). %We prove its correctness by showing trace equivalence to the well-known DPLL SAT-solving algorithm. Second, we implement our construction as a PyTorch model with a tool (PARAT) that we designed to empirically demonstrate its correctness and investigate its properties. Third, rather than \textit{programming} a transformer to reason, we evaluate empirically whether it can be \textit{trained} to do so by learning directly from algorithmic traces (``reasoning paths'') from our theoretical construction. The trained models demonstrate strong out-of-distribution generalization on problem sizes seen during training but has limited length generalization, which is consistent with the implications of our theoretical result

Can Transformers Reason Logically? A Study in SAT Solving

TL;DR

The paper analyzes whether decoder-only Transformers can perform formal logical reasoning for SAT by constructing a CoT-based solver that simulates DPLL backtracking. It provides both a theoretical construction showing feasibility for -SAT and a practical compiler, PARAT, to instantiate the weights and validate correctness on DIMACS 3-SAT instances. Empirically, training on CoT traces yields strong out-of-distribution generalization for the same problem length but fails to generalize to longer instance sizes, highlighting fundamental length-generalization limits. The work advances theoretical foundations for Transformer reasoning in formal domains and offers a practical toolchain for compiling and evaluating such constructions.

Abstract

We formally study the logical reasoning capabilities of decoder-only Transformers in the context of the boolean satisfiability (SAT) problem. First, we prove by construction that decoder-only Transformers can decide 3-SAT, in a non-uniform model of computation, using backtracking and deduction via Chain-of-Thought (CoT). %We prove its correctness by showing trace equivalence to the well-known DPLL SAT-solving algorithm. Second, we implement our construction as a PyTorch model with a tool (PARAT) that we designed to empirically demonstrate its correctness and investigate its properties. Third, rather than \textit{programming} a transformer to reason, we evaluate empirically whether it can be \textit{trained} to do so by learning directly from algorithmic traces (``reasoning paths'') from our theoretical construction. The trained models demonstrate strong out-of-distribution generalization on problem sizes seen during training but has limited length generalization, which is consistent with the implications of our theoretical result

Paper Structure

This paper contains 56 sections, 19 theorems, 83 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 1.1

For any $p, c \in {\mathbb{N}}^+$, there exists a decoder-only Transformer with $O(p^2)$ parameters that can decide all 3-SAT instances of at most $p$ variables and $c$ clauses using Chain-of-Thought.

Figures (6)

  • Figure 1: Visualization of the Chain-of-Thought (CoT) process used by our model to solve an example 3-SAT formula described in \ref{['thm:sat_search']}. The model autonomously performs trial-and-error reasoning, making multiple attempts and backtracking upon encountering conflicts. Here, $T$ represents True and $F$ represents False. Tokens in typewriter font denote the CoT generated by the model.
  • Figure 2: Illustration of the encoding scheme $E(C)$ and $E(A)$ for clauses and partial assignments from \ref{['def:clause_encode']} with $p=4$.
  • Figure 3: Result of the Length generalization experiments, showing SAT/UNSAT prediction accuracy (solid) and full trace accuracy (opaque, dashed) of Transformer models trained on the marginal, random, and skewed dataset with CoT on the marginal dataset over 4-20 variables. Left: model trained on 6-10 variables. Right: model trained on 11-15 variables. Compiled refers to the compiled model corresponding to our theoretical construction.
  • Figure 4: CoT Lengths generated by the compiled SAT-Solver Model vs the number of boolean variables in sampled SAT formulas, y-axis in log scale. Solid lines denote the maximum CoT length for each dataset while opaque, dashed lines denote the average CoT length. The empirical maximum CoT length in our datasets is bounded by $8p\cdot 2^{0.08p}$
  • Figure 5: The impact of soft attention in Transformer layers on the SAT/UNSAT prediction accuracy. $\beta$ is a scaling factor that allows the soft attention operation to better simulate hard attention at the cost of larger model parameter values in attention layers. The model achieves perfect accuracy on all "marginal" datasets starting at $\beta=17.5$, and for lower $\beta$ values, accuracy is negatively correlated with the number of variables in the datasets.
  • ...and 1 more figures

Theorems & Definitions (41)

  • Theorem 1.1: Informal version of \ref{['thm:sat_search']}
  • Definition 4.1: Decision Problem
  • Definition 4.2: Decision Procedure
  • Definition 4.3: Autoregressive Decision Procedure
  • Definition 4.4: $\operatorname{3-SAT}_{p,c}$
  • Theorem 4.5: Decoder-only Transformers can solve SAT
  • Definition 4.6: Encoding of clauses and partial assignments, extending sato2021matsat
  • Lemma 4.7
  • Lemma 4.8: Parallel Processing of Clauses, Informal
  • Lemma 3.1: Simulating a 2-Layer ReLU MLP with ReGLU Activation
  • ...and 31 more