Table of Contents
Fetching ...

On Entropy Control in LLM-RL Algorithms

Han Shen

TL;DR

This paper reveals that traditional entropy regularization offers limited benefits in LLM-RL due to the vast action space and sparse optimal outputs, leading to entropy bias and collapse. It introduces AEnt, which clamps entropy to a smaller, auto-densified token subset and adaptively tunes the entropy coefficient, balancing exploration and optimization bias. Theoretical insights explain why entropy helps only under certain conditions and empirical results show AEnt consistently improving performance across multiple math-reasoning benchmarks, while maintaining stable entropy and shorter, more concise outputs. The work offers a practical strategy to harness entropy benefits in LLM-RL and suggests future directions for refining token-space clamping and theoretical analysis of the clamped-entropy mechanism.

Abstract

For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C. Although entropy regularization proves effective in robotic and games RL conventionally, studies found that it gives weak to no gains in LLM-RL training. In this work, we study the issues of entropy bonus in LLM-RL setting. Specifically, we first argue that the conventional entropy regularization suffers from the LLM's extremely large response space and the sparsity of the optimal outputs. As a remedy, we propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient. The clamped entropy is evaluated with the re-normalized policy defined on certain smaller token space, which encourages exploration within a more compact response set. In addition, the algorithm automatically adjusts entropy coefficient according to the clamped entropy value, effectively controlling the entropy-induced bias while leveraging the entropy's benefits. AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently across multiple benchmarks.

On Entropy Control in LLM-RL Algorithms

TL;DR

This paper reveals that traditional entropy regularization offers limited benefits in LLM-RL due to the vast action space and sparse optimal outputs, leading to entropy bias and collapse. It introduces AEnt, which clamps entropy to a smaller, auto-densified token subset and adaptively tunes the entropy coefficient, balancing exploration and optimization bias. Theoretical insights explain why entropy helps only under certain conditions and empirical results show AEnt consistently improving performance across multiple math-reasoning benchmarks, while maintaining stable entropy and shorter, more concise outputs. The work offers a practical strategy to harness entropy benefits in LLM-RL and suggests future directions for refining token-space clamping and theoretical analysis of the clamped-entropy mechanism.

Abstract

For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C. Although entropy regularization proves effective in robotic and games RL conventionally, studies found that it gives weak to no gains in LLM-RL training. In this work, we study the issues of entropy bonus in LLM-RL setting. Specifically, we first argue that the conventional entropy regularization suffers from the LLM's extremely large response space and the sparsity of the optimal outputs. As a remedy, we propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient. The clamped entropy is evaluated with the re-normalized policy defined on certain smaller token space, which encourages exploration within a more compact response set. In addition, the algorithm automatically adjusts entropy coefficient according to the clamped entropy value, effectively controlling the entropy-induced bias while leveraging the entropy's benefits. AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently across multiple benchmarks.

Paper Structure

This paper contains 21 sections, 8 theorems, 63 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Assume the policy is a softmax. We have:

Figures (6)

  • Figure 1: Test in a controlled MDP with a large action space of size $|{\mathcal{A}}|=10^5$ and increasingly sparse optimal actions.
  • Figure 2: GRPO with a constant entropy bonus coefficient.
  • Figure 3: Test score comparison (see Figure \ref{['fig:entropy resp length comparison']} for more training metrics).
  • Figure 4: Entropy and response length trend (see also Figure \ref{['fig:performance comparison']} for test score comparison).
  • Figure 5: AEnt with adaptive entropy coefficient vs with a constant coefficient. The score in this test is similar. Adaptive coefficient better controls the response length and the policy entropy.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Proposition 1: Bounds under no entropy control
  • Proposition 2: Bound for entropy-regularized methods
  • Lemma 1: Entropy gradient
  • proof
  • Lemma 2
  • proof
  • Lemma 3: Entropy regularized softmax policy gradient
  • proof
  • Lemma 4: Performance difference lemma
  • proof
  • ...and 5 more