Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

Baoding He; Zenan Li; Wei Sun; Yuan Yao; Taolue Chen; Xiaoxing Ma; Zhendong Su

Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

Baoding He, Zenan Li, Wei Sun, Yuan Yao, Taolue Chen, Xiaoxing Ma, Zhendong Su

Abstract

Formal verification via interactive theorem proving is increasingly used to ensure the correctness of critical systems, yet constructing large proof scripts remains highly manual and limits scalability. Advances in large language models (LLMs), especially in mathematical reasoning, make their integration into software verification increasingly promising. This paper introduces a neuro-symbolic proof generation framework designed to automate proof search for systems-level verification projects. The framework performs a best-first tree search over proof states, repeatedly querying an LLM for the next candidate proof step. On the neural side, we fine-tune LLMs using datasets of proof state-step pairs; on the symbolic side, we incorporate a range of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls. This synergy enables data-efficient LLM adaptation and semantics-informed pruning of the search space. We implement the framework on a new Isabelle REPL that exposes fine-grained proof states and automation tools, and evaluate it on the FVEL seL4 benchmark and additional Isabelle developments. On seL4, the system proves up to 77.6\% of the theorems, substantially surpassing previous LLM-based approaches and standalone Sledgehammer, while solving significantly more multi-step proofs. Results across further benchmarks demonstrate strong generalization, indicating a viable path toward scalable automated software verification.

Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

Abstract

Paper Structure (23 sections, 2 equations, 7 figures, 3 tables, 2 algorithms)

This paper contains 23 sections, 2 equations, 7 figures, 3 tables, 2 algorithms.

Introduction
Problem
Methodology
Proof Step Generation & Revision
Proof Step Generation
Proof Step Revision
Proof State Filtering & Ranking
Proof State Filtering
Proof State Ranking
Hammer Integration
Evaluation
Isabelle REPL
Experimental Setup
Experimental Results
RQ1: Effectiveness Comparison
...and 8 more sections

Figures (7)

Figure 1: An example theorem and its human-written proof from the seL4 project, shown alongside LLM-generated attempts. The human-written Isabelle proof illustrates the procedural, tactic-driven reasoning style that is common in software verification. We prompted two state-of-the-art LLMs, Gemini 3 and GPT-5.1, to synthesize the same proof under controlled conditions, with web search disabled to avoid any leakage of the original script. Although both models correctly recognized the proof context and invoked some relevant tactics, neither succeeded in producing a valid proof even after roughly one minute of internal reasoning. Their attempts frequently misuse the domain-specific tactic, or hallucinate nonexistent lemmas.
Figure 2: Overview of our neuro-symbolic proof-search framework. Starting from the initial proof state, the system repeatedly generates candidate proof steps using a fine-tuned LLM, followed by rule-based symbolic revision to repair syntactic or semantic issues in the proposed steps. Each accepted step is executed by Isabelle/HOL to derive new proof states, which are then filtered by symbolic tools to prune states containing counterexamples or duplicates. The remaining states are ranked using cumulative LLM log-probabilities, and the highest-scoring ones are selected for further expansion in a best-first search. If the search cannot complete the proof, the framework finally invokes Sledgehammer to attempt to resolve the remaining goals.
Figure 3: Proof success rate across ground-truth proof lengths. Our approach consistently outperforms existing methods. In particular, the success rate decreases as proofs grow longer, but our approach still maintains non-negligible performance even on proofs exceeding 10 lines.
Figure 4: Proof success rates across the six seL4 session categories. Our approach yields stronger performance than all baselines in all sessions. Notably, it achieves high accuracy even on the challenging SysInitGroup category, which is entirely unseen during training.
Figure 5: Proof completion rate under varying proportions of expert-provided proof lines. Our method is increasingly effective as more human-written lines are supplied. Overall, the method reduces human effort by 71.1% on average.
...and 2 more figures

Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

Abstract

Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

Authors

Abstract

Table of Contents

Figures (7)