Geometric-Mean Policy Optimization

Yuzhong Zhao; Yue Liu; Junpeng Liu; Jingye Chen; Xun Wu; Yaru Hao; Tengchao Lv; Shaohan Huang; Lei Cui; Qixiang Ye; Fang Wan; Furu Wei

Geometric-Mean Policy Optimization

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei

TL;DR

GMPO introduces a geometric-mean objective to stabilize policy optimization in large-language-model RL, addressing instability from outlier token rewards in GRPO. By maximizing the geometric mean and employing token-level clipping with a broader range, GMPO achieves more stable importance weights, higher entropy, and robust updates. The authors provide theoretical justification and empirical validation across five mathematical reasoning benchmarks and a multimodal geometry task, showing consistent improvements over GRPO. The work demonstrates a practical, plug-and-play improvement for post-training RL methods in LLMs and points to future directions in stable, scalable RL for reasoning tasks.

Abstract

Group Relative Policy Optimization (GRPO) has significantly enhanced the reasoning capability of large language models by optimizing the arithmetic mean of token-level rewards. Unfortunately, GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards, which manifest as extreme importance sampling ratios during training. In this study, we propose Geometric-Mean Policy Optimization (GMPO), with the aim to improve the stability of GRPO through suppressing token reward outliers. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. GMPO is plug-and-play-simply replacing GRPO's arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less sensitive to outliers. GMPO is theoretically plausible-analysis reveals that both GMPO and GRPO are weighted forms of the policy gradient while the former enjoys more stable weights, which consequently benefits policy optimization and performance. Experiments on multiple mathematical reasoning benchmarks show that GMPO-7B improves the average Pass@1 of GRPO by up to 4.1%, outperforming many state-of-the-art approaches. Code is available at https://github.com/callsys/GMPO.

Geometric-Mean Policy Optimization

TL;DR

Abstract

Geometric-Mean Policy Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)