Table of Contents
Fetching ...

What is the Alignment Objective of GRPO?

Milan Vojnovic, Se-Young Yun

TL;DR

GRPO reframes alignment by combining a group-based reward-preference model with a reference-policy divergence penalty, yielding a non-logarithmic aggregation of preferences that depends on group size and reward structure. The paper derives per-context stationary conditions via a fixed-point equation, obtaining closed-form solutions for binary questions at both small and large group sizes, and showing a clear link to reverse KL divergence in the penalty term. It connects GRPO to RLHF and NLHF by illustrating how, under extensions like direct KL penalties or shift-only normalisation, the aggregation can resemble logarithmic pooling or RLHF-like objectives. The results clarify how the regularisation constant $\beta$, the confidence margin $\gamma$, and the reference distribution $\pi_{ref}$ jointly shape policy updates, with practical implications for parameter choices and alignment design.

Abstract

In this note, we examine the aggregation of preferences achieved by the Group Policy Optimisation (GRPO) algorithm, a reinforcement learning method used to train advanced artificial intelligence models such as DeepSeek-R1-Zero and DeepSeekMath. The GRPO algorithm trains a policy using a reward preference model, which is computed by sampling a set of outputs for a given context, observing the corresponding rewards, and applying shift-and-scale normalisation to these reward values. Additionally, it incorporates a penalty function to discourage deviations from a reference policy. We present a framework that enables us to characterise the stationary policies of the GRPO algorithm. This analysis reveals that the aggregation of preferences differs fundamentally from standard logarithmic pooling, which is implemented by other approaches such as RLHF. The precise form of preference aggregation arises from the way the reward preference model is defined and from the penalty function, which we show to essentially correspond to the reverse Kullback-Leibler (KL) divergence between the aggregation policy and the reference policy. Interestingly, we demonstrate that for groups of size two, the reward preference model corresponds to pairwise comparison preferences, similar to those in other alignment methods based on pairwise comparison feedback. We provide explicit characterisations of the aggregate preference for binary questions, for groups of size two, and in the limit of large group size. This provides insights into the dependence of the aggregate preference on parameters such as the regularisation constant and the confidence margin of question answers. Finally, we discuss the aggregation of preferences obtained by modifying the GRPO algorithm to use direct KL divergence as the penalty or to use rewards without scale normalisation.

What is the Alignment Objective of GRPO?

TL;DR

GRPO reframes alignment by combining a group-based reward-preference model with a reference-policy divergence penalty, yielding a non-logarithmic aggregation of preferences that depends on group size and reward structure. The paper derives per-context stationary conditions via a fixed-point equation, obtaining closed-form solutions for binary questions at both small and large group sizes, and showing a clear link to reverse KL divergence in the penalty term. It connects GRPO to RLHF and NLHF by illustrating how, under extensions like direct KL penalties or shift-only normalisation, the aggregation can resemble logarithmic pooling or RLHF-like objectives. The results clarify how the regularisation constant , the confidence margin , and the reference distribution jointly shape policy updates, with practical implications for parameter choices and alignment design.

Abstract

In this note, we examine the aggregation of preferences achieved by the Group Policy Optimisation (GRPO) algorithm, a reinforcement learning method used to train advanced artificial intelligence models such as DeepSeek-R1-Zero and DeepSeekMath. The GRPO algorithm trains a policy using a reward preference model, which is computed by sampling a set of outputs for a given context, observing the corresponding rewards, and applying shift-and-scale normalisation to these reward values. Additionally, it incorporates a penalty function to discourage deviations from a reference policy. We present a framework that enables us to characterise the stationary policies of the GRPO algorithm. This analysis reveals that the aggregation of preferences differs fundamentally from standard logarithmic pooling, which is implemented by other approaches such as RLHF. The precise form of preference aggregation arises from the way the reward preference model is defined and from the penalty function, which we show to essentially correspond to the reverse Kullback-Leibler (KL) divergence between the aggregation policy and the reference policy. Interestingly, we demonstrate that for groups of size two, the reward preference model corresponds to pairwise comparison preferences, similar to those in other alignment methods based on pairwise comparison feedback. We provide explicit characterisations of the aggregate preference for binary questions, for groups of size two, and in the limit of large group size. This provides insights into the dependence of the aggregate preference on parameters such as the regularisation constant and the confidence margin of question answers. Finally, we discuss the aggregation of preferences obtained by modifying the GRPO algorithm to use direct KL divergence as the penalty or to use rewards without scale normalisation.

Paper Structure

This paper contains 28 sections, 79 equations, 3 figures.

Figures (3)

  • Figure 1: GRPO's preference aggregation for the case of binary questions with two answers, $a$ or $b$, and groups of size two: $\pi_\theta(a\mid q)$ versus $\pi_{\mathrm{ref}}(a\mid q)$ for the answer $a$ where $\mathcal{P}(a\succ b) > \mathcal{P}(b\succ a)$.
  • Figure 2: GRPO's preference aggregation for the case of binary questions with two answers, $a$ or $b$, in the limit of large group size: $\pi_\theta(a\mid q)$ versus $\pi_{\mathrm{ref}}(a\mid q)$ for the answer $a$ where $r(a\mid q) > r(b\mid q)$.
  • Figure 3: Preference aggregation according to GRPO's reward preference model and direct KL divergence penalty, for the case of binary questions with two possible answers, $a$ or $b$, and groups of size two: $\pi_\theta(a\mid q)$ versus $\pi_{\mathrm{ref}}(a\mid q)$ for the answer $a$ where $r(a\mid q) > r(b\mid q)$. A notable difference from the GRPO's alignment results, shown in Figure \ref{['fig:bin']}, is a lack of discontinuity at $\pi_\mathrm{ref}(a\mid q) = 0$.