Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs
Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, Huaiyu Wan
TL;DR
The paper tackles inefficiencies in reinforcement learning for LLMs arising from uniform rollout budgets and limited exploration. It introduces dynamic rollout budget allocation to concentrate computation on harder questions and a temperature scheduler to keep policy entropy at a stable level, promoting exploration without destabilizing updates. By integrating these components with GRPO, the approach yields improved pass@k metrics on math benchmarks, notably on AIME 2024, while maintaining exploration across problems. The findings suggest that adaptive budgets and entropy-controlled exploration can overcome reward-sparsity issues and push beyond the limitations of standard GRPO with rule-based rewards. This has practical implications for more efficient and robust RL training of LLMs in reasoning tasks.
Abstract
Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model's exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs
