Table of Contents
Fetching ...

Learning from Algorithm Feedback: One-Shot SAT Solver Guidance with GNNs

Jan Tönshoff, Martin Grohe

TL;DR

This work tackles speeding SAT solving by learning to guide branching through a one-shot GNN-based parameterization of per-variable weights $w(x) olinebreak[4]in olinebreak[4] \,\mathbb{R}_{>0}$ and polarities $p(x) olinebreak[4]in olinebreak[4] \{0,1\}$. It introduces Reinforcement Learning from Algorithm Feedback (RLAF), training the GNN with GRPO to minimize solver cost, using the solver as the environment and the cost as the reward. The proposed method integrates into existing solvers via a generic weighting mechanism, enabling a single forward pass to influence all branching decisions, and demonstrates substantial mean-runtime reductions across Glucose and March on diverse problem classes, outperforming supervised baselines that rely on handcrafted features. The results show promising generalization to larger and harder instances, suggesting data-driven, solver-agnostic heuristic design can improve combinatorial optimization in practice, albeit with a note on current scalability limits and the potential for extending the approach to related domains such as MIPs and CSPs.

Abstract

Boolean Satisfiability (SAT) solvers are foundational to computer science, yet their performance typically hinges on hand-crafted heuristics. This work introduces Reinforcement Learning from Algorithm Feedback (RLAF) as a paradigm for learning to guide SAT solver branching heuristics with Graph Neural Networks (GNNs). Central to our approach is a novel and generic mechanism for injecting inferred variable weights and polarities into the branching heuristics of existing SAT solvers. In a single forward pass, a GNN assigns these parameters to all variables. Casting this one-shot guidance as a reinforcement learning problem lets us train the GNN with off-the-shelf policy-gradient methods, such as GRPO, directly using the solver's computational cost as the sole reward signal. Extensive evaluations demonstrate that RLAF-trained policies significantly reduce the mean solve times of different base solvers across diverse SAT problem distributions, achieving more than a 2x speedup in some cases, while generalizing effectively to larger and harder problems after training. Notably, these policies consistently outperform expert-supervised approaches based on learning handcrafted weighting heuristics, offering a promising path towards data-driven heuristic design in combinatorial optimization.

Learning from Algorithm Feedback: One-Shot SAT Solver Guidance with GNNs

TL;DR

This work tackles speeding SAT solving by learning to guide branching through a one-shot GNN-based parameterization of per-variable weights and polarities . It introduces Reinforcement Learning from Algorithm Feedback (RLAF), training the GNN with GRPO to minimize solver cost, using the solver as the environment and the cost as the reward. The proposed method integrates into existing solvers via a generic weighting mechanism, enabling a single forward pass to influence all branching decisions, and demonstrates substantial mean-runtime reductions across Glucose and March on diverse problem classes, outperforming supervised baselines that rely on handcrafted features. The results show promising generalization to larger and harder instances, suggesting data-driven, solver-agnostic heuristic design can improve combinatorial optimization in practice, albeit with a note on current scalability limits and the potential for extending the approach to related domains such as MIPs and CSPs.

Abstract

Boolean Satisfiability (SAT) solvers are foundational to computer science, yet their performance typically hinges on hand-crafted heuristics. This work introduces Reinforcement Learning from Algorithm Feedback (RLAF) as a paradigm for learning to guide SAT solver branching heuristics with Graph Neural Networks (GNNs). Central to our approach is a novel and generic mechanism for injecting inferred variable weights and polarities into the branching heuristics of existing SAT solvers. In a single forward pass, a GNN assigns these parameters to all variables. Casting this one-shot guidance as a reinforcement learning problem lets us train the GNN with off-the-shelf policy-gradient methods, such as GRPO, directly using the solver's computational cost as the sole reward signal. Extensive evaluations demonstrate that RLAF-trained policies significantly reduce the mean solve times of different base solvers across diverse SAT problem distributions, achieving more than a 2x speedup in some cases, while generalizing effectively to larger and harder problems after training. Notably, these policies consistently outperform expert-supervised approaches based on learning handcrafted weighting heuristics, offering a promising path towards data-driven heuristic design in combinatorial optimization.

Paper Structure

This paper contains 32 sections, 20 equations, 7 figures, 5 tables, 4 algorithms.

Figures (7)

  • Figure 1: DPLL Solver
  • Figure 2: a) The input formula $\phi$ is modeled as a graph $G(\phi)$. b) The graph is processed by a trainable GNN and outputs a parameterization policy $\pi_\theta(\phi)$. c) The policy $\pi_\theta(\phi)$ consists of independent variable-wise weight (LogNormal) and polarity (Bernoulli) distributions. d) A variable parameterization ${\mathcal{W}}=(w,p)$ is sampled from $\pi_\theta(\phi)$, mapping each variable $x$ in $\phi$ to a weight $w(x) \in \mathbb{R}_{>0}$ and polarity $p(x) \in \{0,1\}$. e) A guided Sat solver incorporates the parameterization ${\mathcal{W}}$ to guide its branching heuristic.
  • Figure 3: Learning to accelerate a Sat solver with GRPO: a) For a given training formula $\phi$ sample multiple variable parameterizations i.i.d. from the current policy $\pi_\theta(\phi)$. b) Run the Sat solver on $\phi$ with each parameterization. c) Map the cost of each solver run (i.e. the number of decisions) to the normalized group-relative advantage $\hat{A}(\phi,{\mathcal{W}})$. d) Optimize the model weights $\theta$ to maximize ${\mathcal{L}}_\text{PPO}$ to shift the policy towards faster parameterizations.
  • Figure 4: Runtimes relative to the base solver Glucose for RLAF and supervised approaches based on Backbones and Unsat cores. Less is better.
  • Figure 5: Weight correlation between policies learned with different solvers. For each instance distribution, we randomly sample $5,000$ variables $x$ from the corresponding validation set and plot the expected variable weight $E[w(x)]$ for the policies learned with either base solver. The color further indicates the backbone or Unsat core membership of each variable.
  • ...and 2 more figures