Table of Contents
Fetching ...

Group Policy Gradient

Junhua Chen, Zixi Zhang, Hantao Zhong, Rika Antonova

TL;DR

Group Policy Gradient (GPG) presents a critic-free policy-gradient framework for general MDPs by replacing the learned value function with a group-based Monte Carlo advantage estimator, while preserving PPO's clipped objective. The authors establish consistency of the GPG estimator in the large-group limit and analyze bias-variance tradeoffs, supported by empirical results on OpenAI Gymnasium tasks where GPG matches or surpasses PPO, particularly when leveraging many parallel environments. By generalizing GRPO, GPG reduces memory and computation associated with critics and enables efficient use of parallel simulations. The work demonstrates practical benefits and guides design choices for group size and binning, highlighting potential broader impact for scalable, resource-efficient RL methods in general tasks.

Abstract

We introduce Group Policy Gradient (GPG), a family of critic-free policy-gradient estimators for general MDPs. Inspired by the success of GRPO's approach in Reinforcement Learning from Human Feedback (RLHF), GPG replaces a learned value function with a group-based Monte Carlo advantage estimator, removing the memory, compute, and hyperparameter costs of training a critic while preserving PPO's clipped-objective structure. We prove the consistency of the GPG estimator, analyze the bias-variance tradeoffs, and demonstrate empirically that GPG matches or outperforms PPO on standard benchmarks. GPG makes better use of parallel simulations, which, together with its critic-free design, results in more efficient use of computational resources than PPO.

Group Policy Gradient

TL;DR

Group Policy Gradient (GPG) presents a critic-free policy-gradient framework for general MDPs by replacing the learned value function with a group-based Monte Carlo advantage estimator, while preserving PPO's clipped objective. The authors establish consistency of the GPG estimator in the large-group limit and analyze bias-variance tradeoffs, supported by empirical results on OpenAI Gymnasium tasks where GPG matches or surpasses PPO, particularly when leveraging many parallel environments. By generalizing GRPO, GPG reduces memory and computation associated with critics and enables efficient use of parallel simulations. The work demonstrates practical benefits and guides design choices for group size and binning, highlighting potential broader impact for scalable, resource-efficient RL methods in general tasks.

Abstract

We introduce Group Policy Gradient (GPG), a family of critic-free policy-gradient estimators for general MDPs. Inspired by the success of GRPO's approach in Reinforcement Learning from Human Feedback (RLHF), GPG replaces a learned value function with a group-based Monte Carlo advantage estimator, removing the memory, compute, and hyperparameter costs of training a critic while preserving PPO's clipped-objective structure. We prove the consistency of the GPG estimator, analyze the bias-variance tradeoffs, and demonstrate empirically that GPG matches or outperforms PPO on standard benchmarks. GPG makes better use of parallel simulations, which, together with its critic-free design, results in more efficient use of computational resources than PPO.

Paper Structure

This paper contains 33 sections, 21 equations, 5 figures, 1 table, 2 algorithms.

Figures (5)

  • Figure 1: PPO (top) estimates the advantage function using Generalized Advantage Estimation (GAE) with the aid of a learnt value function. In contrast, GPG (bottom) utilizes group-averaged rewards to reduce policy gradient variance. GPG avoids learning a value function and makes greater use of the information in parallel simulations, thereby making better use of computational resources.
  • Figure 2: Average episodic reward of GPG and PPO for different numbers of parallel environments. For clarity, we plot on a logarithmic scale for the CliffWalker environment. Given a large number of parallel environments, GPG dominates on all tasks.
  • Figure 3: Average episodic rewards for GPG with varying numbers of parallel environments are shown, plotted against the number of evaluated environment steps on a logarithmic scale for clarity. While increasing the number of parallel environments generally reduces sample efficiency, requiring more total environment interactions (but fewer iterations) to reach a given performance threshold, it leads to higher iteration-based performance.
  • Figure 4: Average episodic reward of GPG on LunarLander with different binning functions.
  • Figure : GPG Update