Table of Contents
Fetching ...

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi

TL;DR

DeepSearch rethinks RLVR by moving structured search from inference-time to training-time, addressing exploration bottlenecks that cause performance plateaus. By embedding Monte Carlo Tree Search with a global frontier, entropy-guided selection, and adaptive replay buffers into training, it provides fine-grained credit assignment across reasoning steps via Tree-GRPO. The approach delivers a new state-of-the-art on mathematical reasoning benchmarks for 1.5B models (62.95% avg) while demanding far fewer GPU hours, suggesting that strategic exploration can outperform brute-force scaling. This work introduces a practical paradigm shift: scaling reasoning capabilities through algorithmic exploration rather than simply increasing training depth or compute.

Abstract

Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

TL;DR

DeepSearch rethinks RLVR by moving structured search from inference-time to training-time, addressing exploration bottlenecks that cause performance plateaus. By embedding Monte Carlo Tree Search with a global frontier, entropy-guided selection, and adaptive replay buffers into training, it provides fine-grained credit assignment across reasoning steps via Tree-GRPO. The approach delivers a new state-of-the-art on mathematical reasoning benchmarks for 1.5B models (62.95% avg) while demanding far fewer GPU hours, suggesting that strategic exploration can outperform brute-force scaling. This work introduces a practical paradigm shift: scaling reasoning capabilities through algorithmic exploration rather than simply increasing training depth or compute.

Abstract

Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.

Paper Structure

This paper contains 34 sections, 19 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: DeepSearch Framework Overview.
  • Figure 2: Average performance (AIME 2024, AIME 2025, and AMC 2023) of DAPO and DeepSearch after 3K RLVR training. Markers denote evaluations, while dotted lines indicate linear trends.