BFS-PO: Best-First Search for Large Reasoning Models

Fiorenzo Parascandolo; Wenhui Tan; Enver Sangineto; Ruihua Song; Rita Cucchiara

BFS-PO: Best-First Search for Large Reasoning Models

Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara

TL;DR

This paper proposes BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy, and shows that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.

Abstract

Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.

BFS-PO: Best-First Search for Large Reasoning Models

TL;DR

Abstract

Paper Structure (17 sections, 18 equations, 3 figures, 6 tables, 2 algorithms)

This paper contains 17 sections, 18 equations, 3 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Method
Optimization
Discussion
Experiments
Experimental Setup
Ablation
Main Results
Conclusion
GRPO and DAPO
Auxiliary Algorithms
Experimental Details
Accuracy Efficiency Score
...and 2 more sections

Figures (3)

Figure 1: A schematic comparison between GRPO/DAPO (a) and BFS-PO (b-d). In (a), the sampling mechanism of GRPO/DAPO is represented as a simple tree with only one forking node (the root node, conditioned on the question $q$). ✓ and ✗ represent correct and incorrect answers, respectively. In (b) and (d), the current best solution is selected, while in (c) new branches are added to the tree. For simplicity, in this figure we use $G=4$.
Figure 2: A schematic representation of the sub-tree$S(w_n^b)$.
Figure 3: A qualitative example of the exploration process of BFS-PO using Qwen2.5-3B-Instruct during training on GSM8K.

BFS-PO: Best-First Search for Large Reasoning Models

TL;DR

Abstract

BFS-PO: Best-First Search for Large Reasoning Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)