EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework
Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, Yue Wang
TL;DR
This paper tackles the limited exploration and training instability of GRPO when applied to complex reasoning tasks in LLMs. It introduces EFRame, a three-component RL framework consisting of additional rollout for deeper exploration, online filtering to stabilize gradients, and experience replay to reinforce rare but informative trajectories. Empirical results show substantial gains on Geometry3K (a 37.9% relative improvement over GRPO) and consistent improvements across math and multimodal benchmarks, with explicit analysis of entropy dynamics. The work also provides fine-grained sample categorization and practical entropy control, offering a robust, scalable approach to advancing deeper reasoning in LLMs and releasing code publicly.
Abstract
Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO), improves efficiency but suffers from limited exploration and training instability, limiting its effectiveness on complex reasoning tasks. To address these challenges, we introduce EFRame, an Exploration-Filter-Replay framework that augments GRPO across three dimensions: additional rollouts enable deeper and more targeted exploration, online filtering removes low-quality samples to stabilize gradients and accelerate training, and experience replay amplifies rare yet informative trajectories for stable convergence. This unified framework establishes a principled training cycle that balances exploration, efficiency, and stability. Experiments on diverse reasoning benchmarks demonstrate that EFRame achieves consistent gains, including a 37.9\% relative improvement on Geometry3K over GRPO. EFRame further supports fine-grained sample categorization and precise entropy control, highlighting it as a robust solution for advancing deeper reasoning in LLMs. Our code is available at https://github.com/597358816/EFRame.
