Table of Contents
Fetching ...

On Discovering Algorithms for Adversarial Imitation Learning

Shashank Reddy Chirra, Jayden Teoh, Praveen Paruchuri, Pradeep Varakantham

TL;DR

This work addresses the instability of Adversarial Imitation Learning by focusing on the reward assignment (RA) function rather than just density estimation. It proposes Discovered Adversarial Imitation Learning (DAIL), a meta-learned RA discovered via an LLM-guided evolutionary search, which yields RA functions that improve policy learning and training stability across unseen environments and optimizers. Empirical results on MuJoCo and MinAtar demonstrate that DAIL outperforms human-designed baselines (e.g., GAIL, AIRL, FAIRL) and generalizes to different policy learners, with analyses attributing stability to a bounded, informative RA signal that filters low-quality samples. The work also discusses limitations and future directions, including making RA signals time- and state-aware and enhancing reproducibility and safety through transparency and broader testing.

Abstract

Adversarial Imitation Learning (AIL) methods, while effective in settings with limited expert demonstrations, are often considered unstable. These approaches typically decompose into two components: Density Ratio (DR) estimation $\frac{ρ_E}{ρ_π}$, where a discriminator estimates the relative occupancy of state-action pairs under the policy versus the expert; and Reward Assignment (RA), where this ratio is transformed into a reward signal used to train the policy. While significant research has focused on improving density estimation, the role of reward assignment in influencing training dynamics and final policy performance has been largely overlooked. RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity. In this work, we take a different approach: we investigate the discovery of data-driven RA functions, i.e, based directly on the performance of the resulting imitation policy. To this end, we leverage an LLM-guided evolutionary framework that efficiently explores the space of RA functions, yielding \emph{Discovered Adversarial Imitation Learning} (DAIL), the first meta-learnt AIL algorithm. Remarkably, DAIL generalises across unseen environments and policy optimization algorithms, outperforming the current state-of-the-art of \emph{human-designed} baselines. Finally, we analyse why DAIL leads to more stable training, offering novel insights into the role of RA functions in the stability of AIL. Code is publicly available: https://github.com/shshnkreddy/DAIL.

On Discovering Algorithms for Adversarial Imitation Learning

TL;DR

This work addresses the instability of Adversarial Imitation Learning by focusing on the reward assignment (RA) function rather than just density estimation. It proposes Discovered Adversarial Imitation Learning (DAIL), a meta-learned RA discovered via an LLM-guided evolutionary search, which yields RA functions that improve policy learning and training stability across unseen environments and optimizers. Empirical results on MuJoCo and MinAtar demonstrate that DAIL outperforms human-designed baselines (e.g., GAIL, AIRL, FAIRL) and generalizes to different policy learners, with analyses attributing stability to a bounded, informative RA signal that filters low-quality samples. The work also discusses limitations and future directions, including making RA signals time- and state-aware and enhancing reproducibility and safety through transparency and broader testing.

Abstract

Adversarial Imitation Learning (AIL) methods, while effective in settings with limited expert demonstrations, are often considered unstable. These approaches typically decompose into two components: Density Ratio (DR) estimation , where a discriminator estimates the relative occupancy of state-action pairs under the policy versus the expert; and Reward Assignment (RA), where this ratio is transformed into a reward signal used to train the policy. While significant research has focused on improving density estimation, the role of reward assignment in influencing training dynamics and final policy performance has been largely overlooked. RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity. In this work, we take a different approach: we investigate the discovery of data-driven RA functions, i.e, based directly on the performance of the resulting imitation policy. To this end, we leverage an LLM-guided evolutionary framework that efficiently explores the space of RA functions, yielding \emph{Discovered Adversarial Imitation Learning} (DAIL), the first meta-learnt AIL algorithm. Remarkably, DAIL generalises across unseen environments and policy optimization algorithms, outperforming the current state-of-the-art of \emph{human-designed} baselines. Finally, we analyse why DAIL leads to more stable training, offering novel insights into the role of RA functions in the stability of AIL. Code is publicly available: https://github.com/shshnkreddy/DAIL.

Paper Structure

This paper contains 43 sections, 2 theorems, 30 equations, 8 figures, 8 tables, 2 algorithms.

Key Result

Lemma 1

For any scalar function $f: S \times A \to \mathbb{R}$ and discount factor $\gamma\in[0,1)$,

Figures (8)

  • Figure 1: Visualization of the different reward assignment functions.
  • Figure 2: Visualization of the LLM-guided evolution. Appendix \ref{['app:pseudo']} contains the pseudocode of the framework.
  • Figure 3: Performance across generations on the Minatar SpaceInvaders environment. We report the best-performing member per generation, with Generation $0$ denoting the base population. $\mathcal{W}$ distance is normalized relative to the best base member (GAIL).
  • Figure 4: Aggregate performance on the Brax and Minatar suites (excluding SpaceInvaders).
  • Figure 5: (Left) Probability of improvement of DAIL over baselines on Brax. (Right) Performance comparison between DAIL and GAIL (on Minatar SpaceInvaders) using A2C. We report the mean and standard error (SEM).
  • ...and 3 more figures

Theorems & Definitions (4)

  • Lemma 1: Interchange of Expectations
  • proof
  • Lemma 2: Optimal Discriminator
  • proof