Table of Contents
Fetching ...

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, Bolin Ding

TL;DR

This work presents a first-principles derivation for group-relative REINFORCE -- a REINFORCE variant that uses the within-group mean reward as the baseline for advantage calculation -- without assuming a specific training data distribution, showing that it admits a native off-policy interpretation.

Abstract

Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE -- a REINFORCE variant that uses the within-group mean reward as the baseline for advantage calculation -- without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to truly off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms -- Online Policy Mirror Descent and Asymmetric REINFORCE -- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k.

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

TL;DR

This work presents a first-principles derivation for group-relative REINFORCE -- a REINFORCE variant that uses the within-group mean reward as the baseline for advantage calculation -- without assuming a specific training data distribution, showing that it admits a native off-policy interpretation.

Abstract

Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE -- a REINFORCE variant that uses the within-group mean reward as the baseline for advantage calculation -- without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to truly off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms -- Online Policy Mirror Descent and Asymmetric REINFORCE -- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k.

Paper Structure

This paper contains 48 sections, 42 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: A visualization of our off-policy interpretation for group-relative REINFORCE. Here $\widehat{L}(\bm{\theta}; \pi_{\bm{\theta}_t}) = {\mathbb{E}}_{x \sim \widehat{D}} [ \widehat{L}(\bm{\theta}; x, \pi_{\bm{\theta}_t}) ]$, where $\widehat{D}$ is the sampling distribution for prompts and $\widehat{L}(\bm{\theta}; x, \pi_{\bm{\theta}_t})$ is the loss defined in Eq. \ref{['eq:pairwise_surrogate_loss']}.
  • Figure 2: Empirical results for REC algorithms on GSM8k with Qwen2.5-1.5B-Instruct. Training reward curves are smoothed with a running-average window of size 3. Numbers in the legend denote clipping parameters $\epsilon_{\textsf{low}}, \epsilon_{\textsf{high}}$. The "mixed" setting adopts sync_interval = 16 and sync_offset = 8.
  • Figure 3: Empirical results for REC on ToolACE with Llama-3.2-3B-Instruct. Training reward curves are smoothed with a running-average window of size 3. Details about REC-TwoSide and REC-Ring are provided in Appendix \ref{['sec:more_clipping_methods']}.
  • Figure 4: Empirical performance of RED algorithms on GSM8k with Qwen2.5-1.5B-Instruct, in both on-policy and off-policy settings. Training reward curves are smoothed with a running-average window of size 3. Implementation details about RED-Weight and RED-Drop are provided in Appendix \ref{['subsec:RED']}.
  • Figure 5: Empirical results on Guru-Math with Qwen2.5-7B-Instruct. Training reward curves are smoothed with a running-average window of size 3.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Remark 1