Table of Contents
Fetching ...

RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Ma, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, Ge Li

TL;DR

RL-PLUS tackles the capability boundary collapse observed when enhancing LLM reasoning with RL-based verifiable rewards by integrating external data with internal reasoning through a hybrid-policy approach. It introduces two core innovations: Multiple Importance Sampling to stabilize off-policy data integration, and an Exploration-Based Advantage Function to incentivize learning from correct but hard-to-explore reasoning paths. The composite objective blends internal exploitation with externally guided exploration, and theoretical analyses demonstrate reduced bias and variance relative to standard off-policy corrections. Empirically, RL-PLUS achieves state-of-the-art performance on six math benchmarks, strong OOD generalization, and consistent gains across diverse models, indicating effective expansion of the base model’s reasoning capabilities.

Abstract

Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM's immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM's problem-solving scope. To address this problem, we propose RL-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components, i.e., Multiple Importance Sampling to address distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, RL-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2\%. Moreover, the analysis of Pass@k curves indicates that RL-PLUS effectively resolves the capability boundary collapse problem.

RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

TL;DR

RL-PLUS tackles the capability boundary collapse observed when enhancing LLM reasoning with RL-based verifiable rewards by integrating external data with internal reasoning through a hybrid-policy approach. It introduces two core innovations: Multiple Importance Sampling to stabilize off-policy data integration, and an Exploration-Based Advantage Function to incentivize learning from correct but hard-to-explore reasoning paths. The composite objective blends internal exploitation with externally guided exploration, and theoretical analyses demonstrate reduced bias and variance relative to standard off-policy corrections. Empirically, RL-PLUS achieves state-of-the-art performance on six math benchmarks, strong OOD generalization, and consistent gains across diverse models, indicating effective expansion of the base model’s reasoning capabilities.

Abstract

Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM's immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM's problem-solving scope. To address this problem, we propose RL-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components, i.e., Multiple Importance Sampling to address distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, RL-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2\%. Moreover, the analysis of Pass@k curves indicates that RL-PLUS effectively resolves the capability boundary collapse problem.

Paper Structure

This paper contains 41 sections, 9 theorems, 32 equations, 7 figures, 4 tables.

Key Result

Theorem 3.1

So long as there is at least one policy in the behavior pool $\{\pi_{\beta_k}\}$ (e.g., $\pi_{\beta_k^*}$) that is a good approximation of the target policy $\pi_\theta$ (i.e., $\pi_{\beta_k^*} \approx \pi_\theta$), the variance of the MIS estimator will be low. The estimator is insensitive to other

Figures (7)

  • Figure 1: (a) The commonly used RLVR methods can lead to the collapse problem of capability boundaries in base LLMs. (b) RL-PLUS can overcome capability boundary collapse of LLMs in RLVR, consistently showing larger pass@k than base model.
  • Figure 2: Training dynamics of RL-PLUS and other baselines.
  • Figure 3: Pass@k curves of RL-PLUS compared with baselines across multiple benchmarks.
  • Figure 4: Training Stability of RL-PLUS.
  • Figure 5: Detailed Training dynamics of RL-PLUS and other baselines.
  • ...and 2 more figures

Theorems & Definitions (22)

  • Theorem 3.1: Variance Robustness of MIS
  • Theorem 3.2: Bayes-Optimal Policy Estimator
  • Definition A.3: Standard Importance Sampling (IS) Estimator
  • Definition A.4: Proxy IS Estimator
  • Lemma A.5: Bias of the IS Estimator with a Proxy
  • proof
  • Lemma A.6: Bias of the Standard IS Estimator from Support Mismatch
  • proof
  • Lemma A.7: Variance of the IS Ratio
  • proof
  • ...and 12 more