On the Expressive Power of GNNs for Boolean Satisfiability
Saku Peltonen, Roger Wattenhofer
TL;DR
This work establishes fundamental limits on the expressivity of GNNs for SAT solving by analyzing them through the Weisfeiler-Leman hierarchy, proving that the full WL framework cannot universally distinguish satisfiable from unsatisfiable formulas. It identifies both hard and easy subfamilies: 3-regular SAT and k-WL indistinguishable instances resist WL-based discrimination, while PlanarSAT is fully identified by 4-WL and random-SAT instances are largely WL-identifiable. The authors corroborate these findings experimentally on G4SAT and SAT Competition data, showing random instances are often distinguishable with few WL iterations, whereas industrial instances may require greater expressivity or remain unsolved by WL-bound architectures. The results bridge theory and practice, linking WL indistinguishability to sequential solving limitations and highlighting avenues for more expressive GNN designs or alternative strategies. Overall, the paper clarifies when WL-powered GNNs can be effective for SAT and where they fundamentally fail, guiding future architectural and dataset considerations for learning-based SAT solvers.
Abstract
Machine learning approaches to solving Boolean Satisfiability (SAT) aim to replace handcrafted heuristics with learning-based models. Graph Neural Networks have emerged as the main architecture for SAT solving, due to the natural graph representation of Boolean formulas. We analyze the expressive power of GNNs for SAT solving through the lens of the Weisfeiler-Leman (WL) test. As our main result, we prove that the full WL hierarchy cannot, in general, distinguish between satisfiable and unsatisfiable instances. We show that indistinguishability under higher-order WL carries over to practical limitations for WL-bounded solvers that set variables sequentially. We further study the expressivity required for several important families of SAT instances, including regular, random and planar instances. To quantify expressivity needs in practice, we conduct experiments on random instances from the G4SAT benchmark and industrial instances from the International SAT Competition. Our results suggest that while random instances are largely distinguishable, industrial instances often require more expressivity to predict a satisfying assignment.
