Table of Contents
Fetching ...

EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, Yue Wang

TL;DR

This paper tackles the limited exploration and training instability of GRPO when applied to complex reasoning tasks in LLMs. It introduces EFRame, a three-component RL framework consisting of additional rollout for deeper exploration, online filtering to stabilize gradients, and experience replay to reinforce rare but informative trajectories. Empirical results show substantial gains on Geometry3K (a 37.9% relative improvement over GRPO) and consistent improvements across math and multimodal benchmarks, with explicit analysis of entropy dynamics. The work also provides fine-grained sample categorization and practical entropy control, offering a robust, scalable approach to advancing deeper reasoning in LLMs and releasing code publicly.

Abstract

Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO), improves efficiency but suffers from limited exploration and training instability, limiting its effectiveness on complex reasoning tasks. To address these challenges, we introduce EFRame, an Exploration-Filter-Replay framework that augments GRPO across three dimensions: additional rollouts enable deeper and more targeted exploration, online filtering removes low-quality samples to stabilize gradients and accelerate training, and experience replay amplifies rare yet informative trajectories for stable convergence. This unified framework establishes a principled training cycle that balances exploration, efficiency, and stability. Experiments on diverse reasoning benchmarks demonstrate that EFRame achieves consistent gains, including a 37.9\% relative improvement on Geometry3K over GRPO. EFRame further supports fine-grained sample categorization and precise entropy control, highlighting it as a robust solution for advancing deeper reasoning in LLMs. Our code is available at https://github.com/597358816/EFRame.

EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

TL;DR

This paper tackles the limited exploration and training instability of GRPO when applied to complex reasoning tasks in LLMs. It introduces EFRame, a three-component RL framework consisting of additional rollout for deeper exploration, online filtering to stabilize gradients, and experience replay to reinforce rare but informative trajectories. Empirical results show substantial gains on Geometry3K (a 37.9% relative improvement over GRPO) and consistent improvements across math and multimodal benchmarks, with explicit analysis of entropy dynamics. The work also provides fine-grained sample categorization and practical entropy control, offering a robust, scalable approach to advancing deeper reasoning in LLMs and releasing code publicly.

Abstract

Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO), improves efficiency but suffers from limited exploration and training instability, limiting its effectiveness on complex reasoning tasks. To address these challenges, we introduce EFRame, an Exploration-Filter-Replay framework that augments GRPO across three dimensions: additional rollouts enable deeper and more targeted exploration, online filtering removes low-quality samples to stabilize gradients and accelerate training, and experience replay amplifies rare yet informative trajectories for stable convergence. This unified framework establishes a principled training cycle that balances exploration, efficiency, and stability. Experiments on diverse reasoning benchmarks demonstrate that EFRame achieves consistent gains, including a 37.9\% relative improvement on Geometry3K over GRPO. EFRame further supports fine-grained sample categorization and precise entropy control, highlighting it as a robust solution for advancing deeper reasoning in LLMs. Our code is available at https://github.com/597358816/EFRame.

Paper Structure

This paper contains 17 sections, 2 theorems, 7 equations, 5 figures, 1 table.

Key Result

Theorem 3.4

For $o_h\in O_H$ and $o_l\in O_L$, $\sum \hat{A}_{h,0} + \sum\hat{A}_{l,0} = 0$.

Figures (5)

  • Figure 1: Two prominent issues of GRPO: limited exploration and training instability, when training on Qwen2.5-VL-7B-Instruct with the Geometry3K dataset.
  • Figure 2: The overall workflow of EFRame builds upon GRPO by introducing three key components: additional rollout, online filter, and experience replay.
  • Figure 3: On the Geometry3K dataset, our method not only achieves excellent performance but also demonstrates superior exploration capability and training stability.
  • Figure 4: Peak accuracy (first 200 steps) and average entropy (first 150 steps) on Geometry3K with Qwen2.5-VL-7B-Instruct under varying $t_a$ and $R_s$. Higher $t_a$ boosts exploration (entropy↑), larger $R_s$ aids convergence (entropy↓), and best performance arises from the balance state of entropy.
  • Figure 5: The ablation study conducted on the Geometry3K dataset not only validates the effectiveness of our proposed method but also reveals the distinct roles played by samples of varying quality during the training process.

Theorems & Definitions (9)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Theorem 3.4
  • Theorem 3.5: Entropy Change under NPG Update
  • Claim 1
  • Claim 2
  • Claim 3
  • Claim 4