Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
Charles Arnal, Gaëtan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, Remi Munos
TL;DR
This work analyzes Asymmetric REINFORCE (AsymRE) for off-policy reinforcement learning in the context of aligning and fine-tuning language models. By defining the advantage as A = r − V, the authors show that a smaller baseline emphasizes positive samples, while a larger baseline stresses negative ones, with theoretical guarantees under V < V^μ and a phase transition when V ≥ V^μ. They demonstrate that off-policy learning benefits from focusing on positives, but excessive baselines can cause rapid policy collapse and loss of diversity, particularly in language-model applications. Empirical results in a controlled bandit setting and with large-language-model reasoning tasks indicate that a conservative, slightly negative delta V stabilizes training and can outperform a GRPO baseline in off-policy regimes. The study suggests practical guidelines for data-efficient off-policy RL in LLM alignment and highlights avenues for extending the framework with importance sampling and regularization techniques.
Abstract
Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.
