Table of Contents
Fetching ...

Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs

Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, Huaiyu Wan

TL;DR

The paper tackles inefficiencies in reinforcement learning for LLMs arising from uniform rollout budgets and limited exploration. It introduces dynamic rollout budget allocation to concentrate computation on harder questions and a temperature scheduler to keep policy entropy at a stable level, promoting exploration without destabilizing updates. By integrating these components with GRPO, the approach yields improved pass@k metrics on math benchmarks, notably on AIME 2024, while maintaining exploration across problems. The findings suggest that adaptive budgets and entropy-controlled exploration can overcome reward-sparsity issues and push beyond the limitations of standard GRPO with rule-based rewards. This has practical implications for more efficient and robust RL training of LLMs in reasoning tasks.

Abstract

Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model's exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs

Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs

TL;DR

The paper tackles inefficiencies in reinforcement learning for LLMs arising from uniform rollout budgets and limited exploration. It introduces dynamic rollout budget allocation to concentrate computation on harder questions and a temperature scheduler to keep policy entropy at a stable level, promoting exploration without destabilizing updates. By integrating these components with GRPO, the approach yields improved pass@k metrics on math benchmarks, notably on AIME 2024, while maintaining exploration across problems. The findings suggest that adaptive budgets and entropy-controlled exploration can overcome reward-sparsity issues and push beyond the limitations of standard GRPO with rule-based rewards. This has practical implications for more efficient and robust RL training of LLMs in reasoning tasks.

Abstract

Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model's exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs

Paper Structure

This paper contains 15 sections, 27 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Smoothed entropy variations during training under different configurations. The curves represent the mean values, while the shaded regions denote the standard deviation across multiple runs. Here, ER represents entropy regularization, TS refers to the temperature scheduler, and AN indicates annealing. The red vertical line indicates the step at which annealing begins.
  • Figure 2: The left figure illustrates the relationship between the scaling factor of $H_{t}$, after temperature adjustment, and $\alpha$. When the entropy is relatively small (the entropy magnitude of distribution for next token generation is typically on the order of $10^{-1}$), the scaling factor closely approximates a linear relationship with $\alpha$. The right figure illustrates the relationship between $\tau_{t+1}$ and $\alpha$ when $\tau_t = 1$.
  • Figure 3: The accuracy on the validation set during training, with the shaded area representing the variance across multiple runs. The red vertical line indicates the step at which annealing begins.
  • Figure 4: The temperature variation during training is presented for cases utilizing only the temperature scheduler and for those combining the scheduler with annealing.
  • Figure 5: Pass@k on AIME 2024
  • ...and 6 more figures