Table of Contents
Fetching ...

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schutze, Kam-Fai Wong

TL;DR

The paper tackles the persistence of entropy collapse in reinforcement learning with verifiable rewards (RLVR) when training large language models for reasoning. It introduces Exploration-Enhanced Policy Optimization (EEPO), a rollout time intervention that splits rollouts into two stages with a lightweight unlearning step in between to suppress recently sampled dominant responses and promote exploration of alternative reasoning modes. EEPO also enforces targeted, triggerable, and low-overhead unlearning through entropy-based gating, a complementary loss that focuses on high probability tokens, and a single-step rollout update. Empirically, EEPO delivers substantial improvements over GRPO across five math reasoning benchmarks and three model families, with average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base, while maintaining comparable training efficiency, indicating a practical path to better generalization in RLVR systems.

Abstract

Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop-repeatedly sampling and rewarding dominant modes-that further erodes exploration. We introduce Exploration-Enhanced Policy Optimization (EEPO), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

TL;DR

The paper tackles the persistence of entropy collapse in reinforcement learning with verifiable rewards (RLVR) when training large language models for reasoning. It introduces Exploration-Enhanced Policy Optimization (EEPO), a rollout time intervention that splits rollouts into two stages with a lightweight unlearning step in between to suppress recently sampled dominant responses and promote exploration of alternative reasoning modes. EEPO also enforces targeted, triggerable, and low-overhead unlearning through entropy-based gating, a complementary loss that focuses on high probability tokens, and a single-step rollout update. Empirically, EEPO delivers substantial improvements over GRPO across five math reasoning benchmarks and three model families, with average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base, while maintaining comparable training efficiency, indicating a practical path to better generalization in RLVR systems.

Abstract

Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop-repeatedly sampling and rewarding dominant modes-that further erodes exploration. We introduce Exploration-Enhanced Policy Optimization (EEPO), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.

Paper Structure

This paper contains 37 sections, 11 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison of GRPO and EEPO rollout processes. GRPO samples all trajectories from a fixed rollout model, while EEPO introduces an unlearning step on the rollout model between two sampling stages to promote exploration of diverse modes.
  • Figure 2: GRPO training dynamics: rapid entropy collapse accompanies rising Testset and decline on AMC23.
  • Figure 3: Illustration of exploration challenges in GRPO. (a) Policy distribution showing imbalanced modes with a dominant peak. (b) Self-reinforcement effect where the dominant mode becomes increasingly concentrated through positive feedback. (c) Effect of adding randomness (e.g., entropy regularization) which flattens the distribution but maintains the relative dominance of modes.
  • Figure 4: Unlearning suppresses the dominant mode and enables exploration of alternative modes that would otherwise be hard to reach.
  • Figure 5: Impact of hyperparameter choices on baselines performance using Qwen2.5-3B. Each subplot shows the average accuracy across four math benchmarks as a function of (a) temperature, (b) entropy coefficient, (c) clip higher ratio, and (d) number of rollouts. The orange dashed line represents the EEPO with fixed hyperparameters.
  • ...and 3 more figures