Table of Contents
Fetching ...

Adversarial Imitation Learning via Boosting

Jonathan D. Chang, Dhruv Sreenivas, Yingbing Huang, Kianté Brantley, Wen Sun

TL;DR

The paper introduces AILBoost, a principled off-policy adversarial imitation learning algorithm that uses gradient boosting in the state-action occupancy space to build a weighted ensemble of weak policies. AILBoost maintains a weighted replay buffer and updates a discriminator via the variational form of the reverse KL divergence to guide the addition of new weak learners, which are trained with an off-policy RL oracle (SAC). Empirically, AILBoost outperforms DAC, ValueDICE, IQ-Learn, and BC across multiple DeepMind Control Suite tasks, including vision-based settings, and demonstrates robust performance with limited expert data. The approach offers a scalable, data-efficient alternative to prior off-policy IL methods and opens possibilities for extending boosting to discrete control and learning from observations.

Abstract

Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective is on-policy and DAC's ad-hoc application of off-policy training does not guarantee successful imitation (Kostrikov et al., 2019; 2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this issue by deriving a fully off-policy AIL objective. Instead in this work, we develop a novel and principled AIL algorithm via the framework of boosting. Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly weighted weak learners (i.e., policies) and trains a discriminator that witnesses the maximum discrepancy between the distributions of the ensemble and the expert policy. We maintain a weighted replay buffer to represent the state-action distribution induced by the ensemble, allowing us to train discriminators using the entire data collected so far. In the weighted replay buffer, the contribution of the data from older policies are properly discounted with the weight computed based on the boosting framework. Empirically, we evaluate our algorithm on both controller state-based and pixel-based environments from the DeepMind Control Suite. AILBoost outperforms DAC on both types of environments, demonstrating the benefit of properly weighting replay buffer data for off-policy training. On state-based environments, DAC outperforms ValueDICE and IQ-Learn (Gary et al., 2021), achieving competitive performance with as little as one expert trajectory.

Adversarial Imitation Learning via Boosting

TL;DR

The paper introduces AILBoost, a principled off-policy adversarial imitation learning algorithm that uses gradient boosting in the state-action occupancy space to build a weighted ensemble of weak policies. AILBoost maintains a weighted replay buffer and updates a discriminator via the variational form of the reverse KL divergence to guide the addition of new weak learners, which are trained with an off-policy RL oracle (SAC). Empirically, AILBoost outperforms DAC, ValueDICE, IQ-Learn, and BC across multiple DeepMind Control Suite tasks, including vision-based settings, and demonstrates robust performance with limited expert data. The approach offers a scalable, data-efficient alternative to prior off-policy IL methods and opens possibilities for extending boosting to discrete control and learning from observations.

Abstract

Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective is on-policy and DAC's ad-hoc application of off-policy training does not guarantee successful imitation (Kostrikov et al., 2019; 2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this issue by deriving a fully off-policy AIL objective. Instead in this work, we develop a novel and principled AIL algorithm via the framework of boosting. Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly weighted weak learners (i.e., policies) and trains a discriminator that witnesses the maximum discrepancy between the distributions of the ensemble and the expert policy. We maintain a weighted replay buffer to represent the state-action distribution induced by the ensemble, allowing us to train discriminators using the entire data collected so far. In the weighted replay buffer, the contribution of the data from older policies are properly discounted with the weight computed based on the boosting framework. Empirically, we evaluate our algorithm on both controller state-based and pixel-based environments from the DeepMind Control Suite. AILBoost outperforms DAC on both types of environments, demonstrating the benefit of properly weighting replay buffer data for off-policy training. On state-based environments, DAC outperforms ValueDICE and IQ-Learn (Gary et al., 2021), achieving competitive performance with as little as one expert trajectory.
Paper Structure (24 sections, 7 equations, 7 figures, 4 tables, 3 algorithms)

This paper contains 24 sections, 7 equations, 7 figures, 4 tables, 3 algorithms.

Figures (7)

  • Figure 1: Aggregate metrics on DMC environments with 95% confidence intervals (CIs) based on 5 environments spanning easy, medium, and hard tasks. Higher inter-quartile mean (IQM) and mean scores (right) and lower optimality gap (left) is better. The CIs were calculated with percentile bootstrap with stratified sampling over three random seeds and all metrics are reported on the expert normalized scores. AILBoost outperforms DAC, ValueDICE, IQ-Learn, and BC across all metrics, amount of expert demonstrations, and tasks.
  • Figure 2: Learning curves with 1 expert trajectory across 3 random seeds. Note AILBoost successfully imitates expert on all environments where other baselines fail and achieves better sample complexity than DAC. Note that when the environment difficulty level increases, our method shows a larger performance gap compared to baselines (e.g., humanoid stand).
  • Figure 3: Image based: performance on image-based DMC environments, Walker Walk and Cheetah Run, comparing AILBoost, DAC, and BC on three random seeds.
  • Figure 4: Policy and Discriminator Update Schedules: Learning curves for AILBoost on two representative DMC environments, Walker Walk and Ball in Cup Catch, when optimizing with varying policy and discriminator update schemes across 3 seeds.
  • Figure 5: Probability of improvement between all tested baselines and AILBoost.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Remark 1