Table of Contents
Fetching ...

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

Ran Xin, Zeyu Zheng, Yanchen Nie, Kun Yuan, Xia Xiao

TL;DR

BFS-Prover-V2 addresses the dual scaling challenge of LLM-based theorem provers by pairing a training-time, AlphaZero-inspired multi-turn off-policy RL pipeline with an inference-time planner-guided, multi-agent search architecture. The RL component employs adaptive tactic filtering and periodic retraining to overcome long-horizon learning plateaus, while inference uses a high-level Planner to decompose theorems into subgoals solved by parallel Prover agents sharing a subgoal cache. The system achieves state-of-the-art results on miniF2F (95.08% with 95.49% on validation) and ProofNet (41.4%), demonstrating strong generalization and scalability. The combined RL and planning approach offers a generalizable framework for long-horizon reasoning tasks beyond formal mathematics.

Abstract

The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces \texttt{BFS-Prover-V2}, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. \texttt{BFS-Prover-V2} achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

TL;DR

BFS-Prover-V2 addresses the dual scaling challenge of LLM-based theorem provers by pairing a training-time, AlphaZero-inspired multi-turn off-policy RL pipeline with an inference-time planner-guided, multi-agent search architecture. The RL component employs adaptive tactic filtering and periodic retraining to overcome long-horizon learning plateaus, while inference uses a high-level Planner to decompose theorems into subgoals solved by parallel Prover agents sharing a subgoal cache. The system achieves state-of-the-art results on miniF2F (95.08% with 95.49% on validation) and ProofNet (41.4%), demonstrating strong generalization and scalability. The combined RL and planning approach offers a generalizable framework for long-horizon reasoning tasks beyond formal mathematics.

Abstract

The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces \texttt{BFS-Prover-V2}, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. \texttt{BFS-Prover-V2} achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

Paper Structure

This paper contains 22 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the training-time scaling up architecture. The process begins with a current expert model. The system then evaluates the model's performance to check for a plateau, which determines the subsequent path. If performance is improving, the model enters an inner expert iteration loop that involves generating new proofs, applying Adaptive Tactic Filtering, and refining the model. Conversely, if performance has plateaued, the system triggers the outer retraining loop, which consists of Data Re-synthesis, Aggressive Data Curation, and retraining the model from a base checkpoint. Upon completion, this retraining loop yields a new, improved expert model, which then serves as the starting point for the next cycle of evaluation and iteration.
  • Figure 2: Tactic-Level Data Filtering Based on the Perplexity Distribution. This histogram shows the probability distribution of tactic perplexity (represented as negative log-probability) from a single round of expert iteration. We filter out the low- and high-perplexity tails, shown in red. The low-perplexity tail represents overly simple tactics the model is already confident in, while the high-perplexity tail often consists of noisy or unnecessarily complex tactics. By training only on the central part of the distribution (blue), we focus the model's learning on challenging yet meaningful examples, which prevents overfitting and encourages a smoother, more stable improvement in reasoning capabilities.
  • Figure 3: Sustained Performance Improvement through Expert Iteration and Periodic Retraining. This graph plots the prover's performance on miniF2F against the number of expert iteration rounds. Performance steadily increases (blue circles) but eventually begins to plateau as the model settles into a local optimum. To counteract this, we periodically conduct a "soft reset" (red squares). This involves using the current expert model to solve all past problems to generate a cleaner, more efficient dataset, which is then used to retrain the model from a base checkpoint. This procedure allows the model to break out of its local optimum and continue improving, as evidenced by the significant performance jumps following each retraining phase. A model scale-up to 32 billion parameters is also shown (green diamond).
  • Figure 4: Overview of the planner-enhanced multi-agent tree search architecture. The Planner agent decomposes the main theorem into a sequence of simpler subgoals, which are managed in a Shared Subgoal Cache and solved in parallel by multiple Prover agents. Successfully proven subgoals augment the main proof's context, while failures can trigger a Dynamic Replanning loop. The inset provides a toy example, demonstrating how proving intermediate lemmas ($\mathrm{h}_1$, $\mathrm{h}_2$, $\mathrm{h}_3$) facilitates the proof of the final goal.