Table of Contents
Fetching ...

Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

Jialu Wang, Heinrich Peters, Asad A. Butt, Navid Hashemi, Alireza Hashemi, Pouya M. Ghari, Joseph Hoover, James Rae, Morteza Dehghani

TL;DR

Personalized GRPO is introduced, a novel alignment framework that decouples advantage estimation from immediate batch statistics and achieves faster convergence and higher rewards than standard GRPO, thereby enhancing its ability to recover and align with heterogeneous preference signals.

Abstract

Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all samples are exchangeable, inheriting this limitation in personalized settings. This assumption conflates distinct user reward distributions and systematically biases learning toward dominant preferences while suppressing minority signals. To address this, we introduce Personalized GRPO (P-GRPO), a novel alignment framework that decouples advantage estimation from immediate batch statistics. By normalizing advantages against preference-group-specific reward histories rather than the concurrent generation group, P-GRPO preserves the contrastive signal necessary for learning distinct preferences. We evaluate P-GRPO across diverse tasks and find that it consistently achieves faster convergence and higher rewards than standard GRPO, thereby enhancing its ability to recover and align with heterogeneous preference signals. Our results demonstrate that accounting for reward heterogeneity at the optimization level is essential for building models that faithfully align with diverse human preferences without sacrificing general capabilities.

Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

TL;DR

Personalized GRPO is introduced, a novel alignment framework that decouples advantage estimation from immediate batch statistics and achieves faster convergence and higher rewards than standard GRPO, thereby enhancing its ability to recover and align with heterogeneous preference signals.

Abstract

Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all samples are exchangeable, inheriting this limitation in personalized settings. This assumption conflates distinct user reward distributions and systematically biases learning toward dominant preferences while suppressing minority signals. To address this, we introduce Personalized GRPO (P-GRPO), a novel alignment framework that decouples advantage estimation from immediate batch statistics. By normalizing advantages against preference-group-specific reward histories rather than the concurrent generation group, P-GRPO preserves the contrastive signal necessary for learning distinct preferences. We evaluate P-GRPO across diverse tasks and find that it consistently achieves faster convergence and higher rewards than standard GRPO, thereby enhancing its ability to recover and align with heterogeneous preference signals. Our results demonstrate that accounting for reward heterogeneity at the optimization level is essential for building models that faithfully align with diverse human preferences without sacrificing general capabilities.
Paper Structure (38 sections, 1 theorem, 9 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 38 sections, 1 theorem, 9 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Corollary 3.1

The cluster-level advantage $\tilde{A}_{i,t}^p$ defined in Eq eq:userwise-advantage decomposes into a rescaled group advantage and a bias correction term: where $\hat{A}_{i,t}$ is the standard GRPO advantage normalized within the generation group, $\mu_G$ is the mean reward over the current generation group, and $\mu_p$ is the historical mean reward for preference group $p$.

Figures (5)

  • Figure 1: Overview of Personalized Group Relative Policy Optimization (P-GRPO).(a) Latent Reward Distributions: Users in different preference clusters often exhibit distinct reward distributions. In this example, a majority group (Blue) has a high mean reward ($\mu \approx 0.8$), while a minority group (Orange) has a lower mean reward ($\mu \approx 0.3$). (b) GRPO: Standard GRPO normalizes rewards using a generation batch mean ($\mu_\text{Batch}$). (c) GDPO: GDPO yao2025no conditions DPO loss on cluster membership, requiring pairwise preference data (chosen $y_w$ vs. rejected $y_l$) for each cluster. (d) P-GRPO (Ours): Our method normalizes rewards against preference-specific statistics ($\mu_{\text{cluster}}, \sigma_{\text{cluster}}$) maintained via Welford's online algorithm. By comparing each output against its own cluster-wise baseline (e.g., $\mu_b \approx 0.3$ for minority users), P-GRPO correctly assigns advantages ($\tilde{A} \approx 0$), ensuring equitable optimization across diverse user preferences.
  • Figure 2: Training reward curves comparing GRPO and P-GRPO on the MovieLens-1M next-item prediction task across three models: Gemma-2B (left), Qwen3-1.7B (center), and Qwen3-8B (right). P-GRPO consistently converges faster and achieves higher average rewards, demonstrating improved learning efficiency through preference-specific normalization.
  • Figure 3: Test accuracy of Qwen3-8B model on MovieLens-1M dataset. Models are trained with four candidates but evaluated with varying candidate set sizes to assess generalization. P-GRPO consistently outperforms GRPO across all settings.
  • Figure 4: Ablation study on the impact of clustering quality for P-GRPO training on MovieLens-1M dataset with Qwen3-8B. Left: Effect of cluster granularity, comparing different numbers of user clusters. Finer-grained clustering achieves higher rewards than coarser clustering. Right: Effect of random cluster assignment versus K-Means clustering, demonstrating that meaningful cluster quality is essential for personalization gains.
  • Figure 5: LLM-as-judge win rates across user preference clusters. Using GPT-OSS-120B as the judge, we compare responses generated by P-GRPO versus GRPO based on semantic quality, coherence, and user preference alignment. P-GRPO achieves higher win rates across all clusters in both datasets, demonstrating superior personalized generation capabilities.

Theorems & Definitions (2)

  • Corollary 3.1
  • Example 1.1: Linear Reward Model