Table of Contents
Fetching ...

MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems

Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Mitsuki Sakamoto, Ryota Mitsuhashi, Eiji Uchibe

TL;DR

This work identifies reward hacking as a vulnerability of Group Relative Policy Optimization (GRPO) in multi-objective settings, where high-variance rewards can unduly dominate learning. It proposes MO-GRPO, a simple normalization that computes per-reward normalized advantages and aggregates them, ensuring all reward functions contribute evenly while preserving preference orderings under scale changes. The authors establish theoretical properties, including affine-invariance and stable per-reward influence, and demonstrate empirical gains across four domains: multi-armed bandits, simulated control, machine translation, and instruction following. MO-GRPO consistently mitigates reward hacking, improves task-specific metrics, and offers robust performance without manual tuning of reward scales, indicating strong practical value for multi-objective reinforcement learning with imperfect reward models.

Abstract

Group Relative Policy Optimization (GRPO) has been shown to be an effective algorithm when an accurate reward model is available. However, such a highly reliable reward model is not available in many real-world tasks. In this paper, we particularly focus on multi-objective settings, in which we identify that GRPO is vulnerable to reward hacking, optimizing only one of the objectives at the cost of the others. To address this issue, we propose MO-GRPO, an extension of GRPO with a simple normalization method to reweight the reward functions automatically according to the variances of their values. We first show analytically that MO-GRPO ensures that all reward functions contribute evenly to the loss function while preserving the order of preferences, eliminating the need for manual tuning of the reward functions' scales. Then, we evaluate MO-GRPO experimentally in four domains: (i) the multi-armed bandits problem, (ii) simulated control task (Mo-Gymnasium), (iii) machine translation tasks on the WMT benchmark (En-Ja, En-Zh), and (iv) instruction following task. MO-GRPO achieves stable learning by evenly distributing correlations among the components of rewards, outperforming GRPO, showing MO-GRPO to be a promising algorithm for multi-objective reinforcement learning problems.

MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems

TL;DR

This work identifies reward hacking as a vulnerability of Group Relative Policy Optimization (GRPO) in multi-objective settings, where high-variance rewards can unduly dominate learning. It proposes MO-GRPO, a simple normalization that computes per-reward normalized advantages and aggregates them, ensuring all reward functions contribute evenly while preserving preference orderings under scale changes. The authors establish theoretical properties, including affine-invariance and stable per-reward influence, and demonstrate empirical gains across four domains: multi-armed bandits, simulated control, machine translation, and instruction following. MO-GRPO consistently mitigates reward hacking, improves task-specific metrics, and offers robust performance without manual tuning of reward scales, indicating strong practical value for multi-objective reinforcement learning with imperfect reward models.

Abstract

Group Relative Policy Optimization (GRPO) has been shown to be an effective algorithm when an accurate reward model is available. However, such a highly reliable reward model is not available in many real-world tasks. In this paper, we particularly focus on multi-objective settings, in which we identify that GRPO is vulnerable to reward hacking, optimizing only one of the objectives at the cost of the others. To address this issue, we propose MO-GRPO, an extension of GRPO with a simple normalization method to reweight the reward functions automatically according to the variances of their values. We first show analytically that MO-GRPO ensures that all reward functions contribute evenly to the loss function while preserving the order of preferences, eliminating the need for manual tuning of the reward functions' scales. Then, we evaluate MO-GRPO experimentally in four domains: (i) the multi-armed bandits problem, (ii) simulated control task (Mo-Gymnasium), (iii) machine translation tasks on the WMT benchmark (En-Ja, En-Zh), and (iv) instruction following task. MO-GRPO achieves stable learning by evenly distributing correlations among the components of rewards, outperforming GRPO, showing MO-GRPO to be a promising algorithm for multi-objective reinforcement learning problems.

Paper Structure

This paper contains 30 sections, 6 theorems, 27 equations, 11 figures, 15 tables.

Key Result

Theorem 1

Assume the $G\rightarrow\infty$. The correlation coefficient between an individual reward function $R_i$ and the advantage $A_g$ is the ratio of $R_i$'s standard deviation $\sigma_i$ to the standard deviation of the total reward $\sigma$. where $X =\sum_{j \neq i} \operatorname{Cov}(R_i, R_j)$, $\operatorname{Cov}(\cdot,\cdot)$ is covariance.

Figures (11)

  • Figure 1: (Simulated experiment) Comparison of the advantage values of GRPO and MO-GRPO on a toy example with two reward functions with different sizes of variances ($1$ and $5$). The advantage values of GRPO (left figure) are dominated by the high variation reward ($R_2$), indicating that the algorithm is sensitive to the relative scales of the rewards. In contrast, the advantage values of MO-GRPO (right figure) are invariant with the scale of the reward models, which shows that MO-GRPO is an easy-to-use algorithm for multi-objective learning tasks that does not require manual tuning of the reward models to avoid reward hacking.
  • Figure 2: (Multi-armed bandit) This figure illustrates the average rewards obtained by the sum of the three reward functions: GRPO, MO-GRPO, and Dr. GRPO. As the figure shows, MO-GRPO finds a better policy faster than GRPO and Dr. GRPO.
  • Figure 3: (Multi-arm bandit) Comparison of the three reward functions with varying variances ($10$, $1$, and $0.1$) obtained by GRPO, Dr. GRPO, and MO-GRPO. While GRPO and Dr. GRPO fail or are slow to learn the reward functions with lower variances ($R_2$ and $R_3$), MO-GRPO successfully optimizes all three reward functions regardless of the scale of the variances.
  • Figure 4: Simulated control task we use for the experiment. Two‑joint arms with a 6-state vector ($\sin$, $\cos$ of joint angles and their angular velocities) select among 9 discrete actions to reach four targets within a 50‑step episode. Each reward function is defined as $R_i = 1 - 4\lVert p_{\text{arm}} - p_{\text{target},i} \rVert_{2}^{2}$. The optimal control in this environment is to keep swinging the arm at a constant speed.
  • Figure 5: (Machine translation) The training process of GRPO on the WMT En-Ja dataset uses BLEURT and jReadability as the reward functions by Sarashina (sarashina2.2-3b-instruct-v0.1). As the results show, GRPO overfits jReadability at the expense of BLEURT performance. As the results show, the standard deviation of jReadability is always more than BLEURT. As shown in Appendix \ref{['appendix:variance_exp']}, the same phenomenon is observed in other LLMs.
  • ...and 6 more figures

Theorems & Definitions (7)

  • Theorem 1: Correlation between reward function and advantage function with GRPO
  • Theorem 2: Correlation between a reward function and advantage function with MO-GRPO
  • Corollary 1: Correlation between a reward function and advantage function with MO-GRPO under certain assumptions
  • Proposition 1: Affine Invariance of MO-GRPO Advantage
  • Proposition 2
  • Theorem 3: Correlation each reward function and advantage function with Dr. GRPO
  • proof