Table of Contents
Fetching ...

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, Guang Shi

TL;DR

This work targets the exploration-exploitation imbalance in reinforcement learning with verifiable rewards (RLVR) by introducing Pass@k Training, which uses Pass@k as the training reward to promote diverse and broader reasoning. It adds efficiency through bootstrap sampling and provides an analytical derivation of the advantage function, enabling variance-free updates and stable training. The authors demonstrate that Pass@k Training enhances exploration without harming Pass@1 performance, improves generalization to in-domain and out-of-domain tasks, and can transfer benefits to Pass@1 via sequential or combined training. They further connect Pass@k Training to implicit reward design and propose adaptive strategies guided by policy entropy, opening avenues for more flexible RLVR methods and improved large reasoning models.

Abstract

Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., $\textbf{Pass@k Training}$), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

TL;DR

This work targets the exploration-exploitation imbalance in reinforcement learning with verifiable rewards (RLVR) by introducing Pass@k Training, which uses Pass@k as the training reward to promote diverse and broader reasoning. It adds efficiency through bootstrap sampling and provides an analytical derivation of the advantage function, enabling variance-free updates and stable training. The authors demonstrate that Pass@k Training enhances exploration without harming Pass@1 performance, improves generalization to in-domain and out-of-domain tasks, and can transfer benefits to Pass@1 via sequential or combined training. They further connect Pass@k Training to implicit reward design and propose adaptive strategies guided by policy entropy, opening avenues for more flexible RLVR methods and improved large reasoning models.

Abstract

Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., ), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.

Paper Structure

This paper contains 38 sections, 21 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Enigmata scores (Validation Set) of Pass@k Training on Qwen2.5-7B-Ins, which boosts its exploration ability, leading to continuous improvements in following training, surpassing native RLVR and powerful LLMs.
  • Figure 2: The overview and comparison between Pass@1 Training and Pass@k Training. The major difference between these training paradigms is in the reward calculation and advantage estimation process. Besides, full sampling, bootstrap sampling, and analytical derivation are three progressive enhancements for the Pass@k Training. To better demonstrate the Pass@k Training pipeline, we present the pseudo code in Appendix \ref{['app:code']}.
  • Figure 3: Training progress of Pass@1 Training and Pass@k Training with Full Sampling on baseline setting.
  • Figure 4: Training progress of Pass@1 Training and Pass@k Training with Bootstrap Sampling under various $N_\text{rollout}$.
  • Figure 5: Training progress of Pass@1 Training and Pass@k Training with Analytical Derivation and Bootstrap Sampling on baseline setting.
  • ...and 13 more figures