Table of Contents
Fetching ...

SAPG: Split and Aggregate Policy Gradients

Jayesh Singla, Ananye Agarwal, Deepak Pathak

TL;DR

SAPG tackles the saturation of on-policy RL, like PPO, in large-scale parallel environments by dividing environments into blocks and training multiple follower policies whose data are aggregated via importance sampling to update a common leader. The approach blends on-policy PPO updates with off-policy data from other policies, while promoting diversity through latent conditioning and entropy regularization. Empirical results on hard dexterous manipulation tasks show SAPG achieving superior asymptotic performance compared to strong baselines, demonstrating the value of data diversity and off-policy aggregation in scalable RL. This method enables efficient utilization of massive GPU-based simulators for robust, high-performance policy learning in complex robotic control tasks.

Abstract

Despite extreme sample inefficiency, on-policy reinforcement learning, aka policy gradients, has become a fundamental tool in decision-making problems. With the recent advances in GPU-driven simulation, the ability to collect large amounts of data for RL training has scaled exponentially. However, we show that current RL methods, e.g. PPO, fail to ingest the benefit of parallelized environments beyond a certain point and their performance saturates. To address this, we propose a new on-policy RL algorithm that can effectively leverage large-scale environments by splitting them into chunks and fusing them back together via importance sampling. Our algorithm, termed SAPG, shows significantly higher performance across a variety of challenging environments where vanilla PPO and other strong baselines fail to achieve high performance. Website at https://sapg-rl.github.io/

SAPG: Split and Aggregate Policy Gradients

TL;DR

SAPG tackles the saturation of on-policy RL, like PPO, in large-scale parallel environments by dividing environments into blocks and training multiple follower policies whose data are aggregated via importance sampling to update a common leader. The approach blends on-policy PPO updates with off-policy data from other policies, while promoting diversity through latent conditioning and entropy regularization. Empirical results on hard dexterous manipulation tasks show SAPG achieving superior asymptotic performance compared to strong baselines, demonstrating the value of data diversity and off-policy aggregation in scalable RL. This method enables efficient utilization of massive GPU-based simulators for robust, high-performance policy learning in complex robotic control tasks.

Abstract

Despite extreme sample inefficiency, on-policy reinforcement learning, aka policy gradients, has become a fundamental tool in decision-making problems. With the recent advances in GPU-driven simulation, the ability to collect large amounts of data for RL training has scaled exponentially. However, we show that current RL methods, e.g. PPO, fail to ingest the benefit of parallelized environments beyond a certain point and their performance saturates. To address this, we propose a new on-policy RL algorithm that can effectively leverage large-scale environments by splitting them into chunks and fusing them back together via importance sampling. Our algorithm, termed SAPG, shows significantly higher performance across a variety of challenging environments where vanilla PPO and other strong baselines fail to achieve high performance. Website at https://sapg-rl.github.io/
Paper Structure (34 sections, 9 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 9 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: We introduce a new class of on-policy RL algorithms that can scale to tens of thousands of parallel environments. In contrast to regular on-policy RL, such as PPO, which learns a single policy across environments leading to wasted environment capacity, our method learns diverse followers and combines data from them to learn a more optimal leader in a continuous online manner.
  • Figure 2: Performance vs batch size plot for PPO runs (blue curve) across two environments. The curve shows how PPO training runs can not take benefit of large batch size resulting from massively parallelized environments and their asymptotic performance saturates after a certain point. The dashed red line is the performance of our method, SAPG, with more details in the results section. It serves as evidence that higher performance is achievable with larger batch sizes.
  • Figure 3: We illustrate one particular variant of SAPG which performs well. There is one leader and $M-1$ followers ($M=3$ in figure). Each policy has the same backbone with shared parameters $B_\theta$ but is conditioned on local learned parameters $\phi_i$. Each policy gets a block of $\frac{N}{M}$ environments to run. The leader is updated with its on-policy data as well as importance-sampled off-policy data from the followers. Each of the followers only uses their own data for on-policy updates.
  • Figure 4: Two data aggregation schemes we consider in this paper. (Left) one policy is a leader and uses data from each of the followers (Right) a symmetric scheme where each policy uses data from all others. In each case, the policy also uses its own on-policy data.
  • Figure 5: Performance curves of SAPG with respect to PPO, PBT and PQL baselines. On AllegroKuka tasks, PPO and PQL barely make progress and SAPG beats PBT. On Shadow Hand and Allegro Kuka Reorientatio and Two Arms Reorientation, SAPG performs best with an entropy coefficient of 0.005 while the coefficient is 0 for other environments. On ShadowHand and AllegroHand, while PQL is initially more sample efficient, SAPG is more performant in the longer run. AllegroKuka environments use successes as a performance metric while AllegroHand and ShadowHand use episode rewards.
  • ...and 3 more figures