Group Policy Gradient
Junhua Chen, Zixi Zhang, Hantao Zhong, Rika Antonova
TL;DR
Group Policy Gradient (GPG) presents a critic-free policy-gradient framework for general MDPs by replacing the learned value function with a group-based Monte Carlo advantage estimator, while preserving PPO's clipped objective. The authors establish consistency of the GPG estimator in the large-group limit and analyze bias-variance tradeoffs, supported by empirical results on OpenAI Gymnasium tasks where GPG matches or surpasses PPO, particularly when leveraging many parallel environments. By generalizing GRPO, GPG reduces memory and computation associated with critics and enables efficient use of parallel simulations. The work demonstrates practical benefits and guides design choices for group size and binning, highlighting potential broader impact for scalable, resource-efficient RL methods in general tasks.
Abstract
We introduce Group Policy Gradient (GPG), a family of critic-free policy-gradient estimators for general MDPs. Inspired by the success of GRPO's approach in Reinforcement Learning from Human Feedback (RLHF), GPG replaces a learned value function with a group-based Monte Carlo advantage estimator, removing the memory, compute, and hyperparameter costs of training a critic while preserving PPO's clipped-objective structure. We prove the consistency of the GPG estimator, analyze the bias-variance tradeoffs, and demonstrate empirically that GPG matches or outperforms PPO on standard benchmarks. GPG makes better use of parallel simulations, which, together with its critic-free design, results in more efficient use of computational resources than PPO.
