Table of Contents
Fetching ...

Geometric-Mean Policy Optimization

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei

TL;DR

GMPO introduces a geometric-mean objective to stabilize policy optimization in large-language-model RL, addressing instability from outlier token rewards in GRPO. By maximizing the geometric mean and employing token-level clipping with a broader range, GMPO achieves more stable importance weights, higher entropy, and robust updates. The authors provide theoretical justification and empirical validation across five mathematical reasoning benchmarks and a multimodal geometry task, showing consistent improvements over GRPO. The work demonstrates a practical, plug-and-play improvement for post-training RL methods in LLMs and points to future directions in stable, scalable RL for reasoning tasks.

Abstract

Group Relative Policy Optimization (GRPO) has significantly enhanced the reasoning capability of large language models by optimizing the arithmetic mean of token-level rewards. Unfortunately, GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards, which manifest as extreme importance sampling ratios during training. In this study, we propose Geometric-Mean Policy Optimization (GMPO), with the aim to improve the stability of GRPO through suppressing token reward outliers. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. GMPO is plug-and-play-simply replacing GRPO's arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less sensitive to outliers. GMPO is theoretically plausible-analysis reveals that both GMPO and GRPO are weighted forms of the policy gradient while the former enjoys more stable weights, which consequently benefits policy optimization and performance. Experiments on multiple mathematical reasoning benchmarks show that GMPO-7B improves the average Pass@1 of GRPO by up to 4.1%, outperforming many state-of-the-art approaches. Code is available at https://github.com/callsys/GMPO.

Geometric-Mean Policy Optimization

TL;DR

GMPO introduces a geometric-mean objective to stabilize policy optimization in large-language-model RL, addressing instability from outlier token rewards in GRPO. By maximizing the geometric mean and employing token-level clipping with a broader range, GMPO achieves more stable importance weights, higher entropy, and robust updates. The authors provide theoretical justification and empirical validation across five mathematical reasoning benchmarks and a multimodal geometry task, showing consistent improvements over GRPO. The work demonstrates a practical, plug-and-play improvement for post-training RL methods in LLMs and points to future directions in stable, scalable RL for reasoning tasks.

Abstract

Group Relative Policy Optimization (GRPO) has significantly enhanced the reasoning capability of large language models by optimizing the arithmetic mean of token-level rewards. Unfortunately, GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards, which manifest as extreme importance sampling ratios during training. In this study, we propose Geometric-Mean Policy Optimization (GMPO), with the aim to improve the stability of GRPO through suppressing token reward outliers. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. GMPO is plug-and-play-simply replacing GRPO's arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less sensitive to outliers. GMPO is theoretically plausible-analysis reveals that both GMPO and GRPO are weighted forms of the policy gradient while the former enjoys more stable weights, which consequently benefits policy optimization and performance. Experiments on multiple mathematical reasoning benchmarks show that GMPO-7B improves the average Pass@1 of GRPO by up to 4.1%, outperforming many state-of-the-art approaches. Code is available at https://github.com/callsys/GMPO.

Paper Structure

This paper contains 13 sections, 9 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 2: Compared to the arithmetic mean, the geometric mean is more robust to outliers and yields importance sampling ratio distributions with lower variance.
  • Figure 3: The range of importance sampling ratio $\rho_t(\theta)$ with respect to different clipping range and training steps. A wider range indicates less stable policy updates. Compared to GRPO with a clipping range of (0.8, 1.2), GMPO demonstrates greater stability, even with a larger clipping range of ($e^{-0.4}, e^{0.4}$). All curves are smoothed for clarity.
  • Figure 4: Analysis of entropy, KL divergence, gradient norm, validation score over training steps. (a–b) GMPO maintains higher entropy than GRPO, whether trained on MATH Level 3–Level 5 or DeepScaleR dataset. (c-d) GMPO maintains more stable gradient and a smaller KL divergence from the pre-RL model than GRPO. (e–h) GMPO outperforms GRPO in validation scores across language-only and multimodal tasks, for both dense and Mixture-of-Experts models.
  • Figure 5: Analysis of entropy, KL divergence, gradient norm, and validation score over training steps on Mixture-of-Experts models. (a) GMPO maintains smaller KL divergence than GRPO. (b) GMPO maintains higher entropy than GRPO. (c-d) GMPO maintains more stable gradient norm than GRPO, suggesting more stable policy optimization. (e-f) GMPO achieves higher validation score than GRPO.
  • Figure 6: Sequence-level importance sampling ratios from trajectories that yield positive rewards during GRPO training. Without normalization, these ratios can become highly unstable, especially as the response length increases.