Table of Contents
Fetching ...

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal, Gaëtan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, Remi Munos

TL;DR

This work analyzes Asymmetric REINFORCE (AsymRE) for off-policy reinforcement learning in the context of aligning and fine-tuning language models. By defining the advantage as A = r − V, the authors show that a smaller baseline emphasizes positive samples, while a larger baseline stresses negative ones, with theoretical guarantees under V < V^μ and a phase transition when V ≥ V^μ. They demonstrate that off-policy learning benefits from focusing on positives, but excessive baselines can cause rapid policy collapse and loss of diversity, particularly in language-model applications. Empirical results in a controlled bandit setting and with large-language-model reasoning tasks indicate that a conservative, slightly negative delta V stabilizes training and can outperform a GRPO baseline in off-policy regimes. The study suggests practical guidelines for data-efficient off-policy RL in LLM alignment and highlights avenues for extending the framework with importance sampling and regularization techniques.

Abstract

Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

TL;DR

This work analyzes Asymmetric REINFORCE (AsymRE) for off-policy reinforcement learning in the context of aligning and fine-tuning language models. By defining the advantage as A = r − V, the authors show that a smaller baseline emphasizes positive samples, while a larger baseline stresses negative ones, with theoretical guarantees under V < V^μ and a phase transition when V ≥ V^μ. They demonstrate that off-policy learning benefits from focusing on positives, but excessive baselines can cause rapid policy collapse and loss of diversity, particularly in language-model applications. Empirical results in a controlled bandit setting and with large-language-model reasoning tasks indicate that a conservative, slightly negative delta V stabilizes training and can outperform a GRPO baseline in off-policy regimes. The study suggests practical guidelines for data-efficient off-policy RL in LLM alignment and highlights avenues for extending the framework with importance sampling and regularization techniques.

Abstract

Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as , with a reward and some tunable baseline. Intuitively, lowering emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.

Paper Structure

This paper contains 40 sections, 4 theorems, 50 equations, 13 figures.

Key Result

Theorem 4.2

[Analysis of expected AsymRE for tabular softmax policies] Let $Y$ be a finite set, $\mu$ be some behavior policy whose support is $Y$, and consider a softmax policy representation $\pi(y)\stackrel{\text{\small{def}}}{=} e^{l(y)}/\sum_{y'}e^{l(y')}$ on $Y$, where the logits $\{l(y)\}_{y\in Y}$ are t As a consequence, $\operatorname{supp}(\pi^*_{\mu, V_1}) \subseteq \operatorname{supp}(\pi^*_{\mu,

Figures (13)

  • Figure 1: Expected rewards and supports of the policies in the bandits experiments
  • Figure 2: Performance of the current policies for 40 iterations of policy improvement with AsymRE. Each iteration is 500 steps. The curve corresponding to $V\approx V^\mu$ is in red.
  • Figure 3: Training dynamics of Llama 8B on the MATH dataset (results are averaged over $3$ seeds, and a moving average with a window of size $3$ is applied). The behavior policy is updated every $N=250$ training steps.
  • Figure 4: Training dynamics of Qwen 3B on the MATH dataset (results are averaged over $3$ seeds, and a moving average with a window of size $3$ is applied). The behavior policy is updated every $N=250$ training steps.
  • Figure 5: Left: Training dynamics of Llama 8B on the MATH dataset for two values of the baseline, $\delta V=0$ and $\delta V=-0.1$, and $7$ independent runs for each value. The behavior policy is updated every $N = 250$ steps. We observe a systematic collapse when $\delta V=0$. Right: Test accuracy of Llama 8B and Qwen 3B trained on the MATH dataset with GRPO and AsymRE (with $V = -0.1$). The behavior policy is updated every $N = 250$ steps. Asymmetric REINFORCE leads to faster convergence and better than GRPO.
  • ...and 8 more figures

Theorems & Definitions (7)

  • Definition 4.1: Expected AsymRE and AsymRE
  • Theorem 4.2
  • Theorem 4.3
  • Theorem A.1
  • proof
  • Theorem A.1
  • proof