Table of Contents
Fetching ...

Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities

Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov, Ivan Oseledets

TL;DR

The paper addresses entropy collapse in reinforcement learning with verifiable rewards (RLVR) for large language model (LLM) reasoning. It introduces ProGRPO, a probabilistic extension of GRPO that employs Advantage Re-weighting Mechanism (AMR) and Low-Probability Token Length Normalization to balance confidence across correct reasoning paths and emphasize informative, uncertain decision points. Empirical results on math and code benchmarks using Qwen2.5 and DeepSeek models show ProGRPO yields stronger Pass@1 and Pass@32 performance and greater reasoning diversity, with robust generalization to out-of-distribution data. The approach provides a principled mitigation of mode collapse and a more effective exploration–exploitation balance, advancing RLVR for reliable, diverse LLM reasoning.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.

Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities

TL;DR

The paper addresses entropy collapse in reinforcement learning with verifiable rewards (RLVR) for large language model (LLM) reasoning. It introduces ProGRPO, a probabilistic extension of GRPO that employs Advantage Re-weighting Mechanism (AMR) and Low-Probability Token Length Normalization to balance confidence across correct reasoning paths and emphasize informative, uncertain decision points. Empirical results on math and code benchmarks using Qwen2.5 and DeepSeek models show ProGRPO yields stronger Pass@1 and Pass@32 performance and greater reasoning diversity, with robust generalization to out-of-distribution data. The approach provides a principled mitigation of mode collapse and a more effective exploration–exploitation balance, advancing RLVR for reliable, diverse LLM reasoning.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.
Paper Structure (24 sections, 5 theorems, 21 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 24 sections, 5 theorems, 21 equations, 8 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1.1

In standard GRPO, for any two distinct correct responses $o_i, o_j \in \mathcal{O}^+$, the advantage values are identical:

Figures (8)

  • Figure 1: Pass@k comparison on AIME 2024, AIME 2025, and AMC 23 benchmarks using Qwen2.5-7B with FlowRL and GRPO and Ours.
  • Figure 2: Training entropy across optimization steps for different methods. Higher entropy indicates increased exploration during policy optimization.
  • Figure 3: Comparison of model performance across three metrics (average probability, lower 20% probability, and entropy), with statistics computed over 32 rollouts per sample using the AIME2024 dataset.
  • Figure 4: Comparison of rollout token-level entropy on AIME 2024 between OURS and the GRPO baseline.
  • Figure 5: Ablation study of average pass@k performance under different advantage formulations.
  • ...and 3 more figures

Theorems & Definitions (10)

  • Lemma 1.1: Homogeneity of Advantage
  • proof
  • Theorem 1.2
  • proof
  • Theorem 1.3
  • proof
  • Theorem 1.4
  • proof
  • Theorem 1.5
  • proof