Learning from Algorithm Feedback: One-Shot SAT Solver Guidance with GNNs
Jan Tönshoff, Martin Grohe
TL;DR
This work tackles speeding SAT solving by learning to guide branching through a one-shot GNN-based parameterization of per-variable weights $w(x) olinebreak[4]in olinebreak[4] \,\mathbb{R}_{>0}$ and polarities $p(x) olinebreak[4]in olinebreak[4] \{0,1\}$. It introduces Reinforcement Learning from Algorithm Feedback (RLAF), training the GNN with GRPO to minimize solver cost, using the solver as the environment and the cost as the reward. The proposed method integrates into existing solvers via a generic weighting mechanism, enabling a single forward pass to influence all branching decisions, and demonstrates substantial mean-runtime reductions across Glucose and March on diverse problem classes, outperforming supervised baselines that rely on handcrafted features. The results show promising generalization to larger and harder instances, suggesting data-driven, solver-agnostic heuristic design can improve combinatorial optimization in practice, albeit with a note on current scalability limits and the potential for extending the approach to related domains such as MIPs and CSPs.
Abstract
Boolean Satisfiability (SAT) solvers are foundational to computer science, yet their performance typically hinges on hand-crafted heuristics. This work introduces Reinforcement Learning from Algorithm Feedback (RLAF) as a paradigm for learning to guide SAT solver branching heuristics with Graph Neural Networks (GNNs). Central to our approach is a novel and generic mechanism for injecting inferred variable weights and polarities into the branching heuristics of existing SAT solvers. In a single forward pass, a GNN assigns these parameters to all variables. Casting this one-shot guidance as a reinforcement learning problem lets us train the GNN with off-the-shelf policy-gradient methods, such as GRPO, directly using the solver's computational cost as the sole reward signal. Extensive evaluations demonstrate that RLAF-trained policies significantly reduce the mean solve times of different base solvers across diverse SAT problem distributions, achieving more than a 2x speedup in some cases, while generalizing effectively to larger and harder problems after training. Notably, these policies consistently outperform expert-supervised approaches based on learning handcrafted weighting heuristics, offering a promising path towards data-driven heuristic design in combinatorial optimization.
