Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning

Sumeet Batra; Bryon Tjanaka; Matthew C. Fontaine; Aleksei Petrenko; Stefanos Nikolaidis; Gaurav Sukhatme

Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning

Sumeet Batra, Bryon Tjanaka, Matthew C. Fontaine, Aleksei Petrenko, Stefanos Nikolaidis, Gaurav Sukhatme

TL;DR

The paper addresses the challenge of discovering diverse, high-performing policies in high-dimensional robotic tasks by uniting on-policy reinforcement learning with differentiable quality diversity. It introduces Proximal Policy Gradient Arborescence (PPGA), a method that leverages Vectorized PPO (VPPO), Markovian Measure Proxies (MMPs), and Natural Evolution Strategies (NES, specifically xNES) to optimize a differentiable quality-diversity objective within a continuous behavior archive. PPGA provides a novel walking mechanism to move the search policy toward unexplored archive regions and demonstrates a 4× improvement in the humanoid domain’s best reward while preserving diversity, outperforming state-of-the-art QD-RL baselines. The approach highlights a meaningful synergy between on-policy RL and DQD, offering scalable, architecture-friendly guidance for exploring and exploiting diverse robotic skills, with reproducible experiments and resources available for the community.

Abstract

Training generally capable agents that thoroughly explore their environment and learn new and diverse skills is a long-term goal of robot learning. Quality Diversity Reinforcement Learning (QD-RL) is an emerging research area that blends the best aspects of both fields -- Quality Diversity (QD) provides a principled form of exploration and produces collections of behaviorally diverse agents, while Reinforcement Learning (RL) provides a powerful performance improvement operator enabling generalization across tasks and dynamic environments. Existing QD-RL approaches have been constrained to sample efficient, deterministic off-policy RL algorithms and/or evolution strategies, and struggle with highly stochastic environments. In this work, we, for the first time, adapt on-policy RL, specifically Proximal Policy Optimization (PPO), to the Differentiable Quality Diversity (DQD) framework and propose additional improvements over prior work that enable efficient optimization and discovery of novel skills on challenging locomotion tasks. Our new algorithm, Proximal Policy Gradient Arborescence (PPGA), achieves state-of-the-art results, including a 4x improvement in best reward over baselines on the challenging humanoid domain.

Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 11 figures, 3 tables, 4 algorithms)

This paper contains 24 sections, 1 equation, 11 figures, 3 tables, 4 algorithms.

Introduction
Background
Deep Reinforcement Learning
Quality Diversity Optimization
Differentiable Quality Diversity
Quality Diversity Reinforcement Learning
Proposed Method: The Proximal Policy Gradient Arborescence Algorithm
Markovian Measure Proxies
Policy Gradients for Differentiable Quality Diversity Optimization
Connection to Natural Evolution Strategies
Walking the Search Policy
Experiments
Comparisons
Post-Hoc Archive Analysis
Discussion and Limitations
...and 9 more sections

Figures (11)

Figure 1: PPGA finds a diverse archive of high-performing locomotion behaviors for a humanoid agent by combining PPO gradient approximations with Differentiable Quality Diversity algorithms. The archive's dimensions correspond to the measures $m_1$ and $m_2$, i.e., the proportion of time that the left and right feet contact the ground. The color of each cell shows the objective value, i.e., how fast the humanoid moves. For instance, jumping moves the humanoid forward quickly, with the left and right feet individually contacting the ground 30% and 22% of the time, respectively.
Figure 2: PPGA estimates $\nabla f, \nabla \textbf{m}$ with PPO. We randomly sample gradient coefficients $\textbf{c}$ and perform weighted linear recombination of the objective-measure gradients with $\textbf{c}$ as the weights. This produces a population of gradients that, in turn, result in a population of branched policies. The policies are evaluated and inserted into the archive. xNES adapts the gradient coefficient distribution based on these insertions towards maximal archive improvement. The new mean of the coefficient distribution is used to walk the search policy towards a new, potentially unexplored region of the archive.
Figure 3: 2D Archive visualizations of PPGA compared to the current state-of-the-art QD-RL algorithm PGA-ME. We use 50x50 archives to show detail.
Figure 4: QD metrics and cumulative distributions for archives from PPGA compared to baselines. The CCDF plots in the last row indicate the percentage of archive policies that fall above a certain objective threshold. All plots are averaged over four seeds, and the shaded region represents a 95% bootstrapped confidence interval.
Figure 5: PPGA vs TD3GA on Humanoid on the standard QD metrics. All plots are averaged over 4 seeds. The shaded regions are the 95% bootstrapped confidence intervals.
...and 6 more figures

Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning

TL;DR

Abstract

Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)