Table of Contents
Fetching ...

GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

Han Zhang, Ruibin Zheng, Zexuan Yi, Zhuo Zhang, Hanyang Peng, Hui Wang, Zike Yuan, Cai Ke, Shiwei Chen, Jiacheng Yang, Yangning Li, Xiang Li, Jiangyue Yan, Yaoqi Liu, Liwen Jing, Jiayin Qi, Ruifeng Xu, Binxing Fang, Yue Yu

TL;DR

This work addresses the challenge of decentralized, latency-prone RL for large language models by decoupling rollout sampling from parameter updates in a HeteroRL framework. It introduces GEPO, a group-level policy optimization method that replaces token- or sequence-level importance weights with group-averaged expectations, yielding exponential variance reduction under high policy divergence. The authors provide theoretical guarantees and extensive experiments on Qwen3 models with up to 1800 seconds of delay, demonstrating superior stability and performance relative to GRPO, GSPO, and other baselines, especially in heterogeneous networks. Practical contributions include a star-topology multi-node setup, a localized reward computation optimization to reduce communication, and comprehensive hyperparameter analyses guiding deployment in real-world WAN environments.

Abstract

As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance of importance weights and training instability. GEPO mitigates this issue by using group expectation weighting to exponentially reduce the variance of importance weights, with theoretical guarantees. Experiments show GEPO achieves superior stability - only a 3% performance drop from online to 1800s latency-and reduces the best-to-last gap by 85% versus GSPO (1.8 vs. 12.0) while attaining the highest scores, highlighting its effectiveness in decentralized, resource-heterogeneous environments.

GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

TL;DR

This work addresses the challenge of decentralized, latency-prone RL for large language models by decoupling rollout sampling from parameter updates in a HeteroRL framework. It introduces GEPO, a group-level policy optimization method that replaces token- or sequence-level importance weights with group-averaged expectations, yielding exponential variance reduction under high policy divergence. The authors provide theoretical guarantees and extensive experiments on Qwen3 models with up to 1800 seconds of delay, demonstrating superior stability and performance relative to GRPO, GSPO, and other baselines, especially in heterogeneous networks. Practical contributions include a star-topology multi-node setup, a localized reward computation optimization to reduce communication, and comprehensive hyperparameter analyses guiding deployment in real-world WAN environments.

Abstract

As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance of importance weights and training instability. GEPO mitigates this issue by using group expectation weighting to exponentially reduce the variance of importance weights, with theoretical guarantees. Experiments show GEPO achieves superior stability - only a 3% performance drop from online to 1800s latency-and reduces the best-to-last gap by 85% versus GSPO (1.8 vs. 12.0) while attaining the highest scores, highlighting its effectiveness in decentralized, resource-heterogeneous environments.

Paper Structure

This paper contains 71 sections, 5 theorems, 67 equations, 14 figures, 15 tables.

Key Result

Theorem 1

Let $p, q$ be discrete probability distributions. Then there exists a constant $C$ such that: In particular, when $D_{\mathrm{KL}}(p \| q) > \log C$, it holds that $\mathrm{Var}\left[\frac{p(y|x)}{q(y|x)}\right] > \mathrm{Var}\left[ \frac{p(y|x)}{\widehat{\mathbb{E}}_q[q(y|x)]}\right]$.

Figures (14)

  • Figure 1: Left: GEPO improves upon GRPO and GSPO by employing group-level importance weights to enhance training stability. Right: In both zero-delay (online) and high-delay (up to 1800 seconds) heterogeneous reinforcement learning scenarios, GEPO demonstrates superior stability and better evaluation performance.
  • Figure 2: In high-KL regions, $\mathrm{Var}[ \frac{p(y|x)}{\widehat{\mathbb{E}}_q[q(y|x)}] \ll \mathrm{Var}[ \frac{p(y|x)}{q(y|x)}]$.
  • Figure 3: The Overview of HeteroRL. By decoupling sampling and training, HeteroRL enables decentralized distributed RL training of LLMs across five compute nodes: one parameter update node (learner) and four data generation nodes (sampler), forming a star-shaped network topology. Network delays between the sampler and learner nodes are explicitly modeled and can be simulated using stochastic distributions such as the log-normal or Weibull distribution.
  • Figure 4: Curves of importance sampling variance, training gradient norm, and train/eval reward under max delay 64. Compared to GRPO and GSPO, GEPO maintains more stable importance weight variance, resulting in less drastic gradient changes, more stable training, and no decline in training reward.
  • Figure 5: KL-divergence, Variance of IW, and Estimation error are all positively correlated with the number of delay steps.
  • ...and 9 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Lemma 1: Range of Quantities
  • proof
  • proof
  • Corollary 1
  • Theorem 2: Bias of GEPO
  • Theorem 3: Variance of GEPO