Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning
Abdullah Vanlioglu
TL;DR
The paper addresses the challenge of efficient exploration in RL-based fine-tuning of large language models (LLMs) for multi-step reasoning. It introduces Entropy-Guided Sequence Weighting (EGSW), which combines advantage with entropy through a temperature-scaled softmax to weight generated sequences, and supports both step-wise and trajectory-wise updates atop the GRPO framework. Weights are computed as $w_{i,t}^{\text{raw}} = \exp\left((A_{i,t} + \alpha H_{i,t})/P\right)$ with normalized $w_{i,t} = w_{i,t}^{\text{raw}} / \sum_j w_{j,t}^{\text{raw}}$, and reweight the policy gradient accordingly, guided by hyperparameters $\alpha$ and $P$. Empirical results on Qwen2.5-Math-7B-based models (Math-500, AIME, GPQA Diamond) show that GRPO+EGSW yields higher rewards and improved reasoning efficiency compared to GRPO, while highlighting sensitivity to hyperparameter choices and the potential for more compact, informative token generation.
Abstract
We introduce Entropy-Guided Sequence Weighting (EGSW), a novel approach that enhances the exploration-exploitation tradeoff by dynamically assigning weights to generated outputs based on their advantage and entropy for Reinforcement Learning-based Large Language Model fine-tuning. EGSW integrates entropy regularization with advantage-based weighting to balance policy updates, enabling efficient exploration in high-dimensional state spaces. By employing temperature-scaled softmax weighting over sequences, EGSW prioritizing high-reward, high-uncertainty steps while maintaining training stability. Although originally developed to improve Group Relative Policy Optimization (GRPO) during large language model (LLM) fine-tuning, EGSW is generalizable to other reinforcement learning (RL) algorithms and can be implemented in both step-wise and trajectory-wise settings. Empirical evaluations demonstrate that EGSW enhances GRPO reasoning ability, yielding improvements in sample efficiency. Future work will explore the application of EGSW to advanced RL methodologies.
