Table of Contents
Fetching ...

It Takes Two: Your GRPO Is Secretly DPO

Yihong Wu, Liheng Ma, Lei Ding, Muzhi Li, Xinyu Wang, Kejia Chen, Zhan Su, Zhanguang Zhang, Chenyang Huang, Yingxue Zhang, Mark Coates, Jian-Yun Nie

TL;DR

The paper reframes Group Relative Policy Optimization (GRPO) as a contrastive objective and uncovers a link to Direct Preference Optimization (DPO). It then proposes 2-GRPO, a two-rollout per prompt variant, showing that it preserves unbiased gradient estimates and the quality of standard GRPO while delivering major efficiency gains. Theoretical analyses detail how 2-GRPO implicitly normalizes advantages and how gradient variance can be managed by increasing the number of prompts in a batch. Empirically, 2-GRPO matches or closely approaches GRPO performance across math-reasoning benchmarks while reducing wall-clock time by at least 70% and reducing rollouts by roughly an order of magnitude, signaling a practical path toward resource-efficient RL for LLM post-training.

Abstract

Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.

It Takes Two: Your GRPO Is Secretly DPO

TL;DR

The paper reframes Group Relative Policy Optimization (GRPO) as a contrastive objective and uncovers a link to Direct Preference Optimization (DPO). It then proposes 2-GRPO, a two-rollout per prompt variant, showing that it preserves unbiased gradient estimates and the quality of standard GRPO while delivering major efficiency gains. Theoretical analyses detail how 2-GRPO implicitly normalizes advantages and how gradient variance can be managed by increasing the number of prompts in a batch. Empirically, 2-GRPO matches or closely approaches GRPO performance across math-reasoning benchmarks while reducing wall-clock time by at least 70% and reducing rollouts by roughly an order of magnitude, signaling a practical path toward resource-efficient RL for LLM post-training.

Abstract

Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.

Paper Structure

This paper contains 32 sections, 5 theorems, 22 equations, 2 figures, 2 tables.

Key Result

Proposition 3.2

The GRPO objective is a contrastive loss.

Figures (2)

  • Figure 1: Qwen-1.5B: Visualization of reward and evaluation scores on the MATH dataset.
  • Figure 2: Qwen-7B: Visualization of reward and evaluation scores on the MATH dataset.

Theorems & Definitions (12)

  • Definition 3.1: General contrastive loss
  • Proposition 3.2
  • proof : Proof of Proposition \ref{['prop:grpo_is_cl']}
  • Proposition 3.3
  • proof : Proof of Proposition \ref{['prop:dpo_is_cl']}
  • Proposition 4.1
  • Definition 4.2: Gradient Variance
  • Lemma 4.3
  • proof : Proof of Lemma \ref{['lemma:var']}
  • Proposition 4.4
  • ...and 2 more