Table of Contents
Fetching ...

Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning

Abdullah Vanlioglu

TL;DR

The paper addresses the challenge of efficient exploration in RL-based fine-tuning of large language models (LLMs) for multi-step reasoning. It introduces Entropy-Guided Sequence Weighting (EGSW), which combines advantage with entropy through a temperature-scaled softmax to weight generated sequences, and supports both step-wise and trajectory-wise updates atop the GRPO framework. Weights are computed as $w_{i,t}^{\text{raw}} = \exp\left((A_{i,t} + \alpha H_{i,t})/P\right)$ with normalized $w_{i,t} = w_{i,t}^{\text{raw}} / \sum_j w_{j,t}^{\text{raw}}$, and reweight the policy gradient accordingly, guided by hyperparameters $\alpha$ and $P$. Empirical results on Qwen2.5-Math-7B-based models (Math-500, AIME, GPQA Diamond) show that GRPO+EGSW yields higher rewards and improved reasoning efficiency compared to GRPO, while highlighting sensitivity to hyperparameter choices and the potential for more compact, informative token generation.

Abstract

We introduce Entropy-Guided Sequence Weighting (EGSW), a novel approach that enhances the exploration-exploitation tradeoff by dynamically assigning weights to generated outputs based on their advantage and entropy for Reinforcement Learning-based Large Language Model fine-tuning. EGSW integrates entropy regularization with advantage-based weighting to balance policy updates, enabling efficient exploration in high-dimensional state spaces. By employing temperature-scaled softmax weighting over sequences, EGSW prioritizing high-reward, high-uncertainty steps while maintaining training stability. Although originally developed to improve Group Relative Policy Optimization (GRPO) during large language model (LLM) fine-tuning, EGSW is generalizable to other reinforcement learning (RL) algorithms and can be implemented in both step-wise and trajectory-wise settings. Empirical evaluations demonstrate that EGSW enhances GRPO reasoning ability, yielding improvements in sample efficiency. Future work will explore the application of EGSW to advanced RL methodologies.

Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning

TL;DR

The paper addresses the challenge of efficient exploration in RL-based fine-tuning of large language models (LLMs) for multi-step reasoning. It introduces Entropy-Guided Sequence Weighting (EGSW), which combines advantage with entropy through a temperature-scaled softmax to weight generated sequences, and supports both step-wise and trajectory-wise updates atop the GRPO framework. Weights are computed as with normalized , and reweight the policy gradient accordingly, guided by hyperparameters and . Empirical results on Qwen2.5-Math-7B-based models (Math-500, AIME, GPQA Diamond) show that GRPO+EGSW yields higher rewards and improved reasoning efficiency compared to GRPO, while highlighting sensitivity to hyperparameter choices and the potential for more compact, informative token generation.

Abstract

We introduce Entropy-Guided Sequence Weighting (EGSW), a novel approach that enhances the exploration-exploitation tradeoff by dynamically assigning weights to generated outputs based on their advantage and entropy for Reinforcement Learning-based Large Language Model fine-tuning. EGSW integrates entropy regularization with advantage-based weighting to balance policy updates, enabling efficient exploration in high-dimensional state spaces. By employing temperature-scaled softmax weighting over sequences, EGSW prioritizing high-reward, high-uncertainty steps while maintaining training stability. Although originally developed to improve Group Relative Policy Optimization (GRPO) during large language model (LLM) fine-tuning, EGSW is generalizable to other reinforcement learning (RL) algorithms and can be implemented in both step-wise and trajectory-wise settings. Empirical evaluations demonstrate that EGSW enhances GRPO reasoning ability, yielding improvements in sample efficiency. Future work will explore the application of EGSW to advanced RL methodologies.

Paper Structure

This paper contains 7 sections, 10 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Training reward of the methods based on Qwen2.5-Math-7B
  • Figure 2: Completion length of the methods based on Qwen2.5-Math-7B
  • Figure 3: Training reward of the methods based on Qwen2.5-Math-7B-Instruct
  • Figure 4: Completion length of the methods based on Qwen2.5-Math-7B-Instruct