Table of Contents
Fetching ...

AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization

Saeed Hedayatian, Stefanos Nikolaidis

TL;DR

Through experiments in multiple continuous control tasks, AutoQD is demonstrated's ability in discovering diverse policies without predefined behavioral descriptors, presenting a well-motivated alternative to prior methods in unsupervised Reinforcement Learning and QD optimization.

Abstract

Quality-Diversity (QD) algorithms have shown remarkable success in discovering diverse, high-performing solutions, but rely heavily on hand-crafted behavioral descriptors that constrain exploration to predefined notions of diversity. Leveraging the equivalence between policies and occupancy measures, we present a theoretically grounded approach to automatically generate behavioral descriptors by embedding the occupancy measures of policies in Markov Decision Processes. Our method, AutoQD, leverages random Fourier features to approximate the Maximum Mean Discrepancy (MMD) between policy occupancy measures, creating embeddings whose distances reflect meaningful behavioral differences. A low-dimensional projection of these embeddings that captures the most behaviorally significant dimensions can then be used as behavioral descriptors for CMA-MAE, a state of the art blackbox QD method, to discover diverse policies. We prove that our embeddings converge to true MMD distances between occupancy measures as the number of sampled trajectories and embedding dimensions increase. Through experiments in multiple continuous control tasks we demonstrate AutoQD's ability in discovering diverse policies without predefined behavioral descriptors, presenting a well-motivated alternative to prior methods in unsupervised Reinforcement Learning and QD optimization. Our approach opens new possibilities for open-ended learning and automated behavior discovery in sequential decision making settings without requiring domain-specific knowledge. Source code is available at https://github.com/conflictednerd/autoqd-code.

AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization

TL;DR

Through experiments in multiple continuous control tasks, AutoQD is demonstrated's ability in discovering diverse policies without predefined behavioral descriptors, presenting a well-motivated alternative to prior methods in unsupervised Reinforcement Learning and QD optimization.

Abstract

Quality-Diversity (QD) algorithms have shown remarkable success in discovering diverse, high-performing solutions, but rely heavily on hand-crafted behavioral descriptors that constrain exploration to predefined notions of diversity. Leveraging the equivalence between policies and occupancy measures, we present a theoretically grounded approach to automatically generate behavioral descriptors by embedding the occupancy measures of policies in Markov Decision Processes. Our method, AutoQD, leverages random Fourier features to approximate the Maximum Mean Discrepancy (MMD) between policy occupancy measures, creating embeddings whose distances reflect meaningful behavioral differences. A low-dimensional projection of these embeddings that captures the most behaviorally significant dimensions can then be used as behavioral descriptors for CMA-MAE, a state of the art blackbox QD method, to discover diverse policies. We prove that our embeddings converge to true MMD distances between occupancy measures as the number of sampled trajectories and embedding dimensions increase. Through experiments in multiple continuous control tasks we demonstrate AutoQD's ability in discovering diverse policies without predefined behavioral descriptors, presenting a well-motivated alternative to prior methods in unsupervised Reinforcement Learning and QD optimization. Our approach opens new possibilities for open-ended learning and automated behavior discovery in sequential decision making settings without requiring domain-specific knowledge. Source code is available at https://github.com/conflictednerd/autoqd-code.

Paper Structure

This paper contains 47 sections, 4 theorems, 49 equations, 15 figures, 9 tables, 3 algorithms.

Key Result

Theorem 1

For any two policies $\pi_1, \pi_2$ with occupancy measure $\rho_1, \rho_2$ and embeddings $\phi_1, \phi_2$ estimated by taking the mean of the $D$ dimensional random Fourier features of $n$ i.i.d. samples from each occupancy measure, where $d$ is the dimension of state-action vectors and $c>0$ is a constant. A proof is provided in Appendix app:theory.

Figures (15)

  • Figure 1: Overview of AutoQD.Left: Policy parameters are sampled from a CMA-ES instance and evaluated in the environment. The collected trajectories are embedded via a random Fourier features map $\phi$ to produce the policy embedding $\psi^\pi$, which is then projected to a low-dimensional descriptor using the affine map $\mathbf{A}\psi^\pi + \mathbf{b}$. The policy is added to the archive based on its return $J(\pi)$ and descriptors $\mathrm{desc}(\pi)$, and CMA-ES updates its distribution based on the improvement made to the archive. Right: Periodically, embeddings from the archive are used to update $\mathbf{A}$ and $\mathbf{b}$ via cwPCA.
  • Figure 2: Overview of the proposed policy embedding. Each policy $\pi_i$ induces an occupancy measure $\rho^{\pi_i}$ over state-action pairs. From sampled trajectories, a feature map $\phi$ embeds the policies into a vector space. Theorem \ref{['thm:1']} guarantees that the Euclidean distance between embeddings approximates the Maximum Mean Discrepancy (MMD) between the corresponding occupancy measures.
  • Figure 3: Performance of the best policy found by each algorithm under changing friction (left) or mass scale (right). The shaded regions represent the standard error across $32$ evaluation seeds.
  • Figure 4: Number of successfully adapting policies in each population under changing friction. A policy is considered successful if its mean return is at least $Rp$, where $R$ is the highest overall return achieved in the unaltered environment. Results are shown for two success thresholds: $p=0.9$ (left) and $p=0.7$ (right).
  • Figure 5: Quality-diversity trade-off of algorithms across domains. The x-axis shows normalized mean fitness (quality) and the y-axis shows normalized Vendi score (diversity). Each point corresponds to the outcome of an algorithm in one of the six domains. Points above the diagonal line exhibit a more favorable trade-off of quality for diversity.
  • ...and 10 more figures

Theorems & Definitions (7)

  • Theorem 1: MMD Approximation
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof