Table of Contents
Fetching ...

BFS-PO: Best-First Search for Large Reasoning Models

Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara

TL;DR

This paper proposes BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy, and shows that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.

Abstract

Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.

BFS-PO: Best-First Search for Large Reasoning Models

TL;DR

This paper proposes BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy, and shows that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.

Abstract

Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.
Paper Structure (17 sections, 18 equations, 3 figures, 6 tables, 2 algorithms)

This paper contains 17 sections, 18 equations, 3 figures, 6 tables, 2 algorithms.

Figures (3)

  • Figure 1: A schematic comparison between GRPO/DAPO (a) and BFS-PO (b-d). In (a), the sampling mechanism of GRPO/DAPO is represented as a simple tree with only one forking node (the root node, conditioned on the question $q$). ✓ and ✗ represent correct and incorrect answers, respectively. In (b) and (d), the current best solution is selected, while in (c) new branches are added to the tree. For simplicity, in this figure we use $G=4$.
  • Figure 2: A schematic representation of the sub-tree$S(w_n^b)$.
  • Figure 3: A qualitative example of the exploration process of BFS-PO using Qwen2.5-3B-Instruct during training on GSM8K.