Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

Naoki Shitanda; Motoki Omura; Tatsuya Harada; Takayuki Osa

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

Naoki Shitanda, Motoki Omura, Tatsuya Harada, Takayuki Osa

TL;DR

This work theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and proposes Coupled Policy Optimization which regulates diversity through KL constraints between policies, indicating that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods.

Abstract

Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in terms of both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods. Project page at https://naoki04.github.io/paper-cpo/ .

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

TL;DR

Abstract

Paper Structure (38 sections, 19 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 38 sections, 19 equations, 9 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Distributed Reinforcement Learning
Agent Ensemble in Paralleled Environments
Policy Update with Regularization
Preliminaries
Reinforcement Learning
Proximal Policy Optimization (PPO)
Split and Aggregate Policy Gradients (SAPG)
Effect of Policy Diversity on Ensemble Policy Gradient
Coupled Policy Optimization
Follower's Policy Update under KL Constraint
Adversarial Reward for Followers Distribution
Experiments
Results and Analysis
...and 23 more sections

Figures (9)

Figure 1: Appropriately controlled policy diversity improves the learning efficiency of ensemble RL in large-scale environments. (a) The leader-follower approach is an agent ensemble method that aggregates samples from multiple followers into a leader policy. (b) Misalignment between policies may causes a decline in sample efficiency and training stability. (c) Our method introduces KL divergence constraints to keep followers distributed around the leader, as well as adversarial reward to prevent policies overconcentration.
Figure 2: Comparison of algorithm performance across ten robotic tasks. Learning curves across six dexterous manipulation, two gripper-based manipulation and two locomotion tasks comparing CPO to SAPG, PBT, and PPO. CPO consistently achieves higher sample efficiency and final performance, particularly in ShadowHand, AllegroHand, AllegroKukaReorientation, Two-Arms Reorientation, FrankaCubePush and Stack.
Figure 3: Training Curves from the ablation study with different $\lambda_f$.
Figure 4: Comparison of the transition of KL divergence between agents with different algorithms. Each heatmap shows the KL divergence between the leader and follower policies during training. Row $i$, column $j$ indicates the forward KL from agent $i$ to agent $j$. The white circle marks the agent closest from each follower, excluding itself. SAPG often shows misaligned followers, while our method keeps them well-distributed around the leader.
Figure 5: Effects of KL constraint and adversarial reward on performance. Learning curves on ShadowHand and AllegroHand tasks for three variants: full CPO (red), CPO without adversarial reward (blue), and CPO without KL constraint (green).
...and 4 more figures

Theorems & Definitions (3)

proof
proof
proof

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

TL;DR

Abstract

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (3)