Table of Contents
Fetching ...

Behaviour Distillation

Andrei Lupu, Chris Lu, Jarek Liesen, Robert Tjarko Lange, Jakob Foerster

TL;DR

Behaviour distillation is formalized, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data, and visualizing the synthetic datasets can provide human-interpretable task insights.

Abstract

Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of a fixed dataset renders most distillation methods unusable. Filling the gap, we formalize behaviour distillation, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data. We then introduce Hallucinating Datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs which, under supervised learning, train agents to competitive performance levels in continuous control tasks. We show that these datasets generalize out of distribution to training policies with a wide range of architectures and hyperparameters. We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion. Beyond behaviour distillation, HaDES provides significant improvements in neuroevolution for RL over previous approaches and achieves SoTA results on one standard supervised dataset distillation task. Finally, we show that visualizing the synthetic datasets can provide human-interpretable task insights.

Behaviour Distillation

TL;DR

Behaviour distillation is formalized, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data, and visualizing the synthetic datasets can provide human-interpretable task insights.

Abstract

Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of a fixed dataset renders most distillation methods unusable. Filling the gap, we formalize behaviour distillation, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data. We then introduce Hallucinating Datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs which, under supervised learning, train agents to competitive performance levels in continuous control tasks. We show that these datasets generalize out of distribution to training policies with a wide range of architectures and hyperparameters. We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion. Beyond behaviour distillation, HaDES provides significant improvements in neuroevolution for RL over previous approaches and achieves SoTA results on one standard supervised dataset distillation task. Finally, we show that visualizing the synthetic datasets can provide human-interpretable task insights.
Paper Structure (26 sections, 5 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Entire synthetic datasets required to train an optimal Cartpole policy (top) and an expert Hopper policy with behaviour cloning (bottom). The state-action pairs help interpret the learned policies. Red box contains observation features for Cartpole and action labels (torques) for Hopper.
  • Figure 2: Left: Standard neuroevolution. Middle: HaDES-F. Right: HaDES-R. HaDES-F uses a single fixed policy initialization. HaDES-R samples $k\geq2$ policy initializations every generation.
  • Figure 3: HaDES trains competitive policies on a) Brax, using 64 state-action pairs and b) MinAtar using 16 state-action pairs. For each environment, we show the mean return of the population at each generation for HaDES-F, HaDES-R and direct neuroevolution through ES, as well as the PPO final performance after $5\times10^7$ steps. HaDES-F matches or outperforms direct ES on all Brax environments, outperforms ES in one out of four MinAtar environments and matches it in two others. We also observe a significant gap between HaDES-F and HaDES-R, as predicted in \ref{['sect:hades']}.
  • Figure 4: Hopper dataset transfer to other architectures and training parameters. We take a synthetic dataset of 64 state-action pairs evolved for policy networks of size 512 (highlighted) and use it to train policies with varying widths and 50 hyperparameter combinations per width. We plot the top 50% within each width group. HaDES-F indicates that the dataset was trained with a fixed $\pi_0$. The HaDES-R dataset was trained with randomized $(\pi_0^1, ..., \pi^k_0)_i$ and generalizes much better across all architectures and training parameters. This holds generally across environments (see \ref{['sect:appendix_gen']}).
  • Figure 5: We use the synthetic datasets to train multi-task agents without any additional environment interaction. We plot the normalized fitness of agents trained either on the correct dataset for their environment, the wrong dataset for their environment, or a combined dataset, merged through concatenation and zero-padding to have the observation sizes match. Left: we train multi-task agents that achieve $\sim 50\%$ normalized fitness for Halfcheetah and Hopper. Right: we train agents that achieve $\gtrsim 100\%$ normalized fitness for Humanoid and Humanoidstandup. The multi-task policy architecture and training parameters were not optimized. We plot mean $\pm$ stderr. across 10 seeds. This shows that synthetic datasets can accelerate future research on RL foundation models.
  • ...and 9 more figures