Table of Contents
Fetching ...

Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

Wenhao Deng, Long Wei, Chenglei Yu, Tailin Wu

TL;DR

The paper addresses why RLVR with reverse KL regularization stalls exploration and performance as the sampling budget grows, attributing this to weakly exploratory, mode-seeking behavior that confines search to the base model's support. It introduces RAPO, which replaces the reverse KL term with forward KL and adds a reward-aware reference-policy reweighting to enable both out-of-distribution and in-distribution exploration. Through experiments on Qwen-2.5-3B/7B with 8K SimpleRL-Zero data and evaluation on AIME2024/2025, RAPO achieves consistent performance gains across sampling budgets and can surpass the base-model performance ceiling, solving problems previously intractable. This demonstrates RAPO’s potential to significantly advance RLVR for challenging reasoning tasks and suggests broader applicability to domains requiring verifiable rewards and deep reasoning.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model's restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model's support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model's performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.

Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

TL;DR

The paper addresses why RLVR with reverse KL regularization stalls exploration and performance as the sampling budget grows, attributing this to weakly exploratory, mode-seeking behavior that confines search to the base model's support. It introduces RAPO, which replaces the reverse KL term with forward KL and adds a reward-aware reference-policy reweighting to enable both out-of-distribution and in-distribution exploration. Through experiments on Qwen-2.5-3B/7B with 8K SimpleRL-Zero data and evaluation on AIME2024/2025, RAPO achieves consistent performance gains across sampling budgets and can surpass the base-model performance ceiling, solving problems previously intractable. This demonstrates RAPO’s potential to significantly advance RLVR for challenging reasoning tasks and suggests broader applicability to domains requiring verifiable rewards and deep reasoning.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model's restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model's support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model's performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.

Paper Structure

This paper contains 19 sections, 3 theorems, 29 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Lemma 3.1

The optimal policy $\pi_\theta^{\star}$ to the problem Eq. eq:origional_RKL satisfies

Figures (3)

  • Figure 1: Approach Motivation and Illustration. The reference model $\pi_\text{ref}$ and the reward function are shared among four subfigures. (a) High-reward regions with low/zero probability in the reference model are underexplored yet. (b) RLVR with our proposed forward KL divergence facilitates out-of-distribution exploration, overcoming reverse KL divergence limitations. (c) Our reward-aware reference policy reweighting mechanism for adaptive in-distribution exploration. (d)$\text{RAPO}\xspace$, integrating the reweighted reference policy with forward KL divergence optimization, boosts exploration effectiveness.
  • Figure 2: Illustration of the forward KL based optimization. The support of $\pi_\theta^{\star}$ extends beyond that of $\pi_\text{ref}$ (token IDs = 0, 1, 2, 13, 14, 15). The numerical solution from gradient descent optimization of Eq. \ref{['eq:discrete_FKL_max_entropy']} matches the numerical root of the equation in the theoretical result of Proposition \ref{['prop:1']}.
  • Figure 3: Comparison of mathematical reasoning performance among our $\text{RAPO}\xspace$, the Base Model, and GRPO-RKL on AIME25 Full (left) dataset and AIME24 Hard (right) subset and Qwen2.5-7B model. Pass@$k$ is evaluated at $k=2^m$ for $m\in[0,1,\cdots,10]$. The total number of samples is $n=2048$.

Theorems & Definitions (6)

  • Lemma 3.1
  • Lemma 3.2
  • Proposition 3.3
  • proof : Proof of Lemma \ref{['lemma:1']}
  • proof : Proof of Lemma \ref{['lemma:2']}
  • proof : Proof of Proposition \ref{['prop:1']}