Table of Contents
Fetching ...

Mimicking Better by Matching the Approximate Action Distribution

João A. Cândido Ramos, Lionel Blondé, Naoya Takeishi, Alexandros Kalousis

TL;DR

MAAD tackles imitation learning from observations by coupling an on-policy learner with a learned inverse dynamics model to regularize action choices toward physically plausible options. The approach blends surrogate rewards from adversarial imitation, trajectory matching, or optimal transport with a KL-based regularizer that aligns the policy to the IDM posterior $p_{\alpha,\psi}(a|s,s')$, i.e., $L = L_{policy} + L_{reg}$ where $L_{reg} = \mathbb{E}_{(s,s')\sim \zeta} D_{KL}(p_{\alpha,\psi}(a|s,s') \|\| \pi_\theta(a|s))$. The IDM is trained as a mixture-density network and updated online from a replay buffer, enabling rapid adaptation without expert actions. Experiments on MuJoCo OpenAI Gym and the DeepMind Control Suite show substantial gains in sample efficiency and frequent expert-level performance, even when expert actions are unavailable, outperforming several state-of-the-art on-policy baselines. This work broadens the applicability of imitation learning to action-inaccessible data (e.g., videos) in continuous-control domains, offering a simple, effective path to robust IL from observations in robotics-like settings.

Abstract

In this paper, we introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. MAAD utilizes a surrogate reward signal, which can be derived from various sources such as adversarial games, trajectory matching objectives, or optimal transport criteria. To compensate for the non-availability of expert actions, we rely on an inverse dynamics model that infers plausible actions distribution given the expert's state-state transitions; we regularize the imitator's policy by aligning it to the inferred action distribution. MAAD leads to significantly improved sample efficiency and stability. We demonstrate its effectiveness in a number of MuJoCo environments, both int the OpenAI Gym and the DeepMind Control Suite. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods. Remarkably, MAAD often stands out as the sole method capable of attaining expert performance levels, underscoring its simplicity and efficacy.

Mimicking Better by Matching the Approximate Action Distribution

TL;DR

MAAD tackles imitation learning from observations by coupling an on-policy learner with a learned inverse dynamics model to regularize action choices toward physically plausible options. The approach blends surrogate rewards from adversarial imitation, trajectory matching, or optimal transport with a KL-based regularizer that aligns the policy to the IDM posterior , i.e., where . The IDM is trained as a mixture-density network and updated online from a replay buffer, enabling rapid adaptation without expert actions. Experiments on MuJoCo OpenAI Gym and the DeepMind Control Suite show substantial gains in sample efficiency and frequent expert-level performance, even when expert actions are unavailable, outperforming several state-of-the-art on-policy baselines. This work broadens the applicability of imitation learning to action-inaccessible data (e.g., videos) in continuous-control domains, offering a simple, effective path to robust IL from observations in robotics-like settings.

Abstract

In this paper, we introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. MAAD utilizes a surrogate reward signal, which can be derived from various sources such as adversarial games, trajectory matching objectives, or optimal transport criteria. To compensate for the non-availability of expert actions, we rely on an inverse dynamics model that infers plausible actions distribution given the expert's state-state transitions; we regularize the imitator's policy by aligning it to the inferred action distribution. MAAD leads to significantly improved sample efficiency and stability. We demonstrate its effectiveness in a number of MuJoCo environments, both int the OpenAI Gym and the DeepMind Control Suite. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods. Remarkably, MAAD often stands out as the sole method capable of attaining expert performance levels, underscoring its simplicity and efficacy.
Paper Structure (27 sections, 14 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 14 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Median Normalized Return, over different environments, of various instantiations of our method (solid lines) versus baselines (dashed curves). Methods marked with $\dag$ have access to expert actions representing the best possible performance, all others do not. More details on the construction of the figure in Section \ref{['sec:median_norm']}.
  • Figure 2: Performance comparison between our proposed version of MAAD and the baselines (some of the baselines, highlighted here with $\dag$, have access to expert actions, Section \ref{['sec:baselines']} for more information). We average the results over three random seeds and show the mean and the range of one standard deviation.
  • Figure 3: Interactions-based performance comparison of the different methods. Methods marked with $\dag$, have access to expert actions, Section \ref{['sec:baselines']} for more information). We average the results over three random seeds and show the mean and the range of one standard deviation.
  • Figure 4: Computational time-based performance comparison of the different methods. Methods marked with $\dag$, have access to expert actions, Section \ref{['sec:baselines']} for more information). We average the results over three random seeds and show the mean and the range of one standard deviation.
  • Figure 5: Median Normalized Return, over different environments, of various instantiations of our method (solid lines) versus baselines (dashed curves). This plot is derived by quantalising the training curves present in Fig. \ref{['fig:results_inter_big']}, using a fixed number of quantiles, here 5000, with median computation per algorithm across the different environments of each suite tested. Methods marked with $\dag$ have access to expert actions representing the best possible performance, all others do not.