Table of Contents
Fetching ...

Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments

Bryan L. M. de Oliveira, Felipe V. Frujeri, Marcos P. C. M. Queiroz, Luana G. B. Martins, Telma W. de L. Soares, Luckeciano C. Melo

TL;DR

This work systematically evaluates Group Relative Policy Optimization (GRPO), a critic-free policy gradient method, in classical single-task RL benchmarks. By ablating baselines, discounting, and group-based grouping, it shows that learned critics are essential for long-horizon tasks, while critic-free GRPO can perform competitively only in short-horizon environments like CartPole. GRPO generally benefits from high discount factors ($\gamma \approx 0.99$), with HalfCheetah as a notable exception where $\gamma = 0.9$ is preferable due to lack of early termination. Additionally, smaller group sizes yield better stability and efficiency, revealing limitations of batch-based grouping when mixing unrelated episodes. These results clarify the conditions under which critic-free methods are viable and identify grouping strategies as a crucial area for future improvement.

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a scalable alternative to Proximal Policy Optimization (PPO) by eliminating the learned critic and instead estimating advantages through group-relative comparisons of trajectories. This simplification raises fundamental questions about the necessity of learned baselines in policy-gradient methods. We present the first systematic study of GRPO in classical single-task reinforcement learning environments, spanning discrete and continuous control tasks. Through controlled ablations isolating baselines, discounting, and group sampling, we reveal three key findings: (1) learned critics remain essential for long-horizon tasks: all critic-free baselines underperform PPO except in short-horizon environments like CartPole where episodic returns can be effective; (2) GRPO benefits from high discount factors (gamma = 0.99) except in HalfCheetah, where lack of early termination favors moderate discounting (gamma = 0.9); (3) smaller group sizes outperform larger ones, suggesting limitations in batch-based grouping strategies that mix unrelated episodes. These results reveal both the limitations of critic-free methods in classical control and the specific conditions where they remain viable alternatives to learned value functions.

Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments

TL;DR

This work systematically evaluates Group Relative Policy Optimization (GRPO), a critic-free policy gradient method, in classical single-task RL benchmarks. By ablating baselines, discounting, and group-based grouping, it shows that learned critics are essential for long-horizon tasks, while critic-free GRPO can perform competitively only in short-horizon environments like CartPole. GRPO generally benefits from high discount factors (), with HalfCheetah as a notable exception where is preferable due to lack of early termination. Additionally, smaller group sizes yield better stability and efficiency, revealing limitations of batch-based grouping when mixing unrelated episodes. These results clarify the conditions under which critic-free methods are viable and identify grouping strategies as a crucial area for future improvement.

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a scalable alternative to Proximal Policy Optimization (PPO) by eliminating the learned critic and instead estimating advantages through group-relative comparisons of trajectories. This simplification raises fundamental questions about the necessity of learned baselines in policy-gradient methods. We present the first systematic study of GRPO in classical single-task reinforcement learning environments, spanning discrete and continuous control tasks. Through controlled ablations isolating baselines, discounting, and group sampling, we reveal three key findings: (1) learned critics remain essential for long-horizon tasks: all critic-free baselines underperform PPO except in short-horizon environments like CartPole where episodic returns can be effective; (2) GRPO benefits from high discount factors (gamma = 0.99) except in HalfCheetah, where lack of early termination favors moderate discounting (gamma = 0.9); (3) smaller group sizes outperform larger ones, suggesting limitations in batch-based grouping strategies that mix unrelated episodes. These results reveal both the limitations of critic-free methods in classical control and the specific conditions where they remain viable alternatives to learned value functions.

Paper Structure

This paper contains 15 sections, 3 equations, 10 figures.

Figures (10)

  • Figure 1: Performance comparison of PPO and GRPO with commonly used settings. This motivating example shows PPO ($\gamma=0.99$, $N_{\text{steps}}=128$) and GRPO ($\gamma=1$, $N_{\text{steps}}=H$) using our grouping strategy. Relative performance varies and depends on the environment characteristics.
  • Figure 2: Baseline ablations across environments. We compare PPO with its learned critic against PPO variants without a baseline, with simple alternatives (batch mean, Gaussian, EMA), and GRPO with group-relative normalization. Removing the baseline substantially increases variance, especially in long-horizon continuous control, while GRPO performs comparably to simple baselines.
  • Figure 3: Effect of discount factor across environments. GRPO performance with varying $\gamma$ values. Higher $\gamma$ generally improves performance, with notable exceptions in HalfCheetah (optimal around $\gamma = 0.9$-$0.95$) and MountainCarContinuous (best at $\gamma = 0.99$).
  • Figure 4: Effect of group size on GRPO across environments. Each subfigure shows episodic returns when varying the number of parallel environments used as a group ($G \in \{8, 16, 32, 64\}$).
  • Figure 5: PPO vs. GRPO. This motivating example shows PPO with standard settings ($\gamma=0.99$, trajectory length 128) and GRPO with standard settings using our grouping strategy ($\gamma=1$, full episodes). Relative performance varies and depends on the environment characteristics. This figure shows all environments.
  • ...and 5 more figures