VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search

David Brandfonbrener; Simon Henniger; Sibi Raja; Tarun Prasad; Chloe Loughridge; Federico Cassano; Sabrina Ruixin Hu; Jianang Yang; William E. Byrd; Robert Zinkov; Nada Amin

VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search

David Brandfonbrener, Simon Henniger, Sibi Raja, Tarun Prasad, Chloe Loughridge, Federico Cassano, Sabrina Ruixin Hu, Jianang Yang, William E. Byrd, Robert Zinkov, Nada Amin

TL;DR

VerMCTS advances verified program synthesis by tightly integrating a formal verifier with a large language model within a modified Monte Carlo Tree Search. The method defines an MDP over code-prefix states, uses progressive widening and a verifier-informed scoring function to guide expansion, and backpropagation to converge on complete, verified programs in Dafny and Coq. A new 15-program problem suite demonstrates that VerMCTS substantially improves pass@T compared to baselines, with notable gains in harder Coq tasks and in exploring deeper search trees. The work demonstrates that cheap, sound verifier feedback can dramatically enhance multi-step program synthesis, with implications for safer code generation and broader applicability to formal reasoning tasks.

Abstract

Large Language Models (LLMs) can generate useful code, but often the code they generate cannot be trusted to be sound. In this paper, we present VerMCTS, an approach to begin to resolve this issue by generating verified programs in Dafny and Coq. VerMCTS uses a logical verifier in concert with an LLM to guide a modified Monte Carlo Tree Search (MCTS). This approach leverages the verifier to gain intermediate feedback inside the search algorithm by checking partial programs at each step to estimate an upper bound on the value function. To measure the performance of VerMCTS, we develop a new suite of multi-step verified programming problems in Dafny and Coq. In terms of pass@T, a new metric which computes the pass rate given a budget of T tokens sampled from the LLM, VerMCTS leads to more than a 30% absolute increase in average pass@5000 across the suite over repeated sampling from the base language model. Our code and benchmarks are available at https://github.com/namin/llm-verified-with-monte-carlo-tree-search .

VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search

TL;DR

Abstract

Paper Structure (68 sections, 1 theorem, 3 equations, 8 figures, 1 algorithm)

This paper contains 68 sections, 1 theorem, 3 equations, 8 figures, 1 algorithm.

Introduction
Method: VerMCTS
MDP for verified program synthesis
VerMCTS
Progressive widening.
Selection: priors and UCT.
Combining expansion and evaluation.
Backpropagation.
Connecting the partial program score to the MDP
A problem suite for multi-step verified programming
Defining the problems
Criteria for Success
Experimental setup
Pass@$T$ evaluation metric
Base model
...and 53 more sections

Key Result

Lemma 2.1

The value $v(s)$ returned by Algorithm alg:eval_and_expand satisfies the following:

Figures (8)

Figure 1: Overview of VerMCTS. The search tree is visualized with "widen" nodes marked with $w$. (a) A leaf node is selected as in standard MCTS. (b) The selected node is evaluated and maybe expanded. If the selected node is a widen node, then it's parent is selected and maybe expanded (i.e. made wider). See Figure \ref{['fig:expand']} for a detailed description. (c) Once we have a value and maybe from the evaluate and maybe expand algorithm, we backpropagate that value up the tree. This figure illustrates the special case where we observed a failure, so no node is added and the score is -1.
Figure 2: Evaluate and maybe expand. Given a prefix, we query the LLM for expansions until the verifier is able to return a score. If that score is a failure, we do not add the node to the tree, but update the parent with a value of -1. If the score is pass, then we check if the program is complete. If incomplete, we add the expansion to the tree with a score of 1. If complete, we have found a successful program to return.
Figure 3: Average results for pass@T vs. T (the number of tokens) for various baseline methods on our suite of programming problems in Dafny and Coq.
Figure 4: Pass@T results for all algorithms on our suite of problems in Dafny.
Figure 5: Pass@T results for all algorithms on our suite of problems in Coq.
...and 3 more figures

Theorems & Definitions (1)

Lemma 2.1

VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search

TL;DR

Abstract

VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (1)