Table of Contents
Fetching ...

Enabling Adaptive Agent Training in Open-Ended Simulators by Targeting Diversity

Robby Costales, Stefanos Nikolaidis

TL;DR

The empirical results showcase DIVA's unique ability to overcome complex parameterizations and successfully train adaptive agent behavior, far outperforming competitive baselines from prior literature and highlighting the potential of such semi-supervised environment design (SSED) approaches to enable training in realistic simulated domains, and produce more robust and capable adaptive agents.

Abstract

The wider application of end-to-end learning methods to embodied decision-making domains remains bottlenecked by their reliance on a superabundance of training data representative of the target domain. Meta-reinforcement learning (meta-RL) approaches abandon the aim of zero-shot generalization--the goal of standard reinforcement learning (RL)--in favor of few-shot adaptation, and thus hold promise for bridging larger generalization gaps. While learning this meta-level adaptive behavior still requires substantial data, efficient environment simulators approaching real-world complexity are growing in prevalence. Even so, hand-designing sufficiently diverse and numerous simulated training tasks for these complex domains is prohibitively labor-intensive. Domain randomization (DR) and procedural generation (PG), offered as solutions to this problem, require simulators to possess carefully-defined parameters which directly translate to meaningful task diversity--a similarly prohibitive assumption. In this work, we present DIVA, an evolutionary approach for generating diverse training tasks in such complex, open-ended simulators. Like unsupervised environment design (UED) methods, DIVA can be applied to arbitrary parameterizations, but can additionally incorporate realistically-available domain knowledge--thus inheriting the flexibility and generality of UED, and the supervised structure embedded in well-designed simulators exploited by DR and PG. Our empirical results showcase DIVA's unique ability to overcome complex parameterizations and successfully train adaptive agent behavior, far outperforming competitive baselines from prior literature. These findings highlight the potential of such semi-supervised environment design (SSED) approaches, of which DIVA is the first humble constituent, to enable training in realistic simulated domains, and produce more robust and capable adaptive agents.

Enabling Adaptive Agent Training in Open-Ended Simulators by Targeting Diversity

TL;DR

The empirical results showcase DIVA's unique ability to overcome complex parameterizations and successfully train adaptive agent behavior, far outperforming competitive baselines from prior literature and highlighting the potential of such semi-supervised environment design (SSED) approaches to enable training in realistic simulated domains, and produce more robust and capable adaptive agents.

Abstract

The wider application of end-to-end learning methods to embodied decision-making domains remains bottlenecked by their reliance on a superabundance of training data representative of the target domain. Meta-reinforcement learning (meta-RL) approaches abandon the aim of zero-shot generalization--the goal of standard reinforcement learning (RL)--in favor of few-shot adaptation, and thus hold promise for bridging larger generalization gaps. While learning this meta-level adaptive behavior still requires substantial data, efficient environment simulators approaching real-world complexity are growing in prevalence. Even so, hand-designing sufficiently diverse and numerous simulated training tasks for these complex domains is prohibitively labor-intensive. Domain randomization (DR) and procedural generation (PG), offered as solutions to this problem, require simulators to possess carefully-defined parameters which directly translate to meaningful task diversity--a similarly prohibitive assumption. In this work, we present DIVA, an evolutionary approach for generating diverse training tasks in such complex, open-ended simulators. Like unsupervised environment design (UED) methods, DIVA can be applied to arbitrary parameterizations, but can additionally incorporate realistically-available domain knowledge--thus inheriting the flexibility and generality of UED, and the supervised structure embedded in well-designed simulators exploited by DR and PG. Our empirical results showcase DIVA's unique ability to overcome complex parameterizations and successfully train adaptive agent behavior, far outperforming competitive baselines from prior literature. These findings highlight the potential of such semi-supervised environment design (SSED) approaches, of which DIVA is the first humble constituent, to enable training in realistic simulated domains, and produce more robust and capable adaptive agents.

Paper Structure

This paper contains 61 sections, 19 figures, 6 tables, 3 algorithms.

Figures (19)

  • Figure 1: Highly structured environment simulators assume access to parameterizations $E_\textnormal{S}(\bm{\theta})$ for which random seeds $\bm{\theta}_i$directly produce meaningfully diverse features (e.g. Racing tracks with challenging turns). Open-ended environments with flexible, unstructured parameterizations $E_\textnormal{U}(\bm{\theta})$—though enabling more complex emergent features—lack direct control over high-level features of interest. We introduceDIVA, an approach that effectively creates a more workable parameterization $E_\textnormal{QD}(\bm{\theta})$ by evolving levels beyond the minimally diverse population from $E_\textnormal{U}(\bm{\theta})$. By training on these discovered levels, DIVA enables superior performance on downstream tasks.
  • Figure 2: DIVA archive updates on Alchemy. The first stage (a) begins with bounds that encapsulate initial solutions, and the target region. As the first stage progresses (b), and QD discovers more of the solution space, the sampling region for the emitters gradually shrinks towards the target region. The second stage begins by redefining the archive bounds to be the target region and including some extra feature dimensions (c). QD fills out just the target region now (d), using sample weights from the target-derived prior (e), the same distribution used to sample levels during meta-training.
  • Figure 3: Left: A GridNav agent attempting to locate the goal across two episodic rollouts. Right: The marginal probability of sampled goals inhabiting each $y$ for different complexities $k$ of $E_{\textcolor{black}{k}}(\textcolor{black}{\bm{\theta}})$.
  • Figure 4: GridNav analysis and results. (a) Target region coverage produced by DIVA and DR over different genotype complexities $k$. DR represents the average coverage of batches corresponding to the size of the QD archive. DR$^*$ represents the total number of unique levels discovered over the course of parameter randomization steps which equal in number to the additional environments PLR$^\perp$ is provided for evaluation. DR$^*$ is thus an upper bound on the diversity that PLR$^\perp$ can capture. 500k iterations (QD or otherwise) are used across all results. (b) The diversity produced by PLR$^\perp$ and ACCEL over the course of training (later updates omitted due to no change in trend). (c) Final episode return curves for DIVA and baselines. (d) Final method success rates across each episode.
  • Figure 5: Alchemy environment and results. (a) A visual representation of Alchemy's structured stone latent space. $P_1$ and $P_2$ represent potions acting on stones. Only $P_1$ results in a latent state change, because $P_2$ would push the stone outside of the valid latent lattice. (b) Marginal feature distributions for $E_\textnormal{S}$ (the structured target distribution), DIVA, and $E_\textnormal{U}$ (the unstructured distribution used directly for DR, and to initialize DIVA's archive). (c) Final episode return curves for DIVA and baselines. (d) Number of unique genotypes used by each method over the course of meta-training.
  • ...and 14 more figures