Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments
Bryan L. M. de Oliveira, Felipe V. Frujeri, Marcos P. C. M. Queiroz, Luana G. B. Martins, Telma W. de L. Soares, Luckeciano C. Melo
TL;DR
This work systematically evaluates Group Relative Policy Optimization (GRPO), a critic-free policy gradient method, in classical single-task RL benchmarks. By ablating baselines, discounting, and group-based grouping, it shows that learned critics are essential for long-horizon tasks, while critic-free GRPO can perform competitively only in short-horizon environments like CartPole. GRPO generally benefits from high discount factors ($\gamma \approx 0.99$), with HalfCheetah as a notable exception where $\gamma = 0.9$ is preferable due to lack of early termination. Additionally, smaller group sizes yield better stability and efficiency, revealing limitations of batch-based grouping when mixing unrelated episodes. These results clarify the conditions under which critic-free methods are viable and identify grouping strategies as a crucial area for future improvement.
Abstract
Group Relative Policy Optimization (GRPO) has emerged as a scalable alternative to Proximal Policy Optimization (PPO) by eliminating the learned critic and instead estimating advantages through group-relative comparisons of trajectories. This simplification raises fundamental questions about the necessity of learned baselines in policy-gradient methods. We present the first systematic study of GRPO in classical single-task reinforcement learning environments, spanning discrete and continuous control tasks. Through controlled ablations isolating baselines, discounting, and group sampling, we reveal three key findings: (1) learned critics remain essential for long-horizon tasks: all critic-free baselines underperform PPO except in short-horizon environments like CartPole where episodic returns can be effective; (2) GRPO benefits from high discount factors (gamma = 0.99) except in HalfCheetah, where lack of early termination favors moderate discounting (gamma = 0.9); (3) smaller group sizes outperform larger ones, suggesting limitations in batch-based grouping strategies that mix unrelated episodes. These results reveal both the limitations of critic-free methods in classical control and the specific conditions where they remain viable alternatives to learned value functions.
