Mimicking Better by Matching the Approximate Action Distribution
João A. Cândido Ramos, Lionel Blondé, Naoya Takeishi, Alexandros Kalousis
TL;DR
MAAD tackles imitation learning from observations by coupling an on-policy learner with a learned inverse dynamics model to regularize action choices toward physically plausible options. The approach blends surrogate rewards from adversarial imitation, trajectory matching, or optimal transport with a KL-based regularizer that aligns the policy to the IDM posterior $p_{\alpha,\psi}(a|s,s')$, i.e., $L = L_{policy} + L_{reg}$ where $L_{reg} = \mathbb{E}_{(s,s')\sim \zeta} D_{KL}(p_{\alpha,\psi}(a|s,s') \|\| \pi_\theta(a|s))$. The IDM is trained as a mixture-density network and updated online from a replay buffer, enabling rapid adaptation without expert actions. Experiments on MuJoCo OpenAI Gym and the DeepMind Control Suite show substantial gains in sample efficiency and frequent expert-level performance, even when expert actions are unavailable, outperforming several state-of-the-art on-policy baselines. This work broadens the applicability of imitation learning to action-inaccessible data (e.g., videos) in continuous-control domains, offering a simple, effective path to robust IL from observations in robotics-like settings.
Abstract
In this paper, we introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. MAAD utilizes a surrogate reward signal, which can be derived from various sources such as adversarial games, trajectory matching objectives, or optimal transport criteria. To compensate for the non-availability of expert actions, we rely on an inverse dynamics model that infers plausible actions distribution given the expert's state-state transitions; we regularize the imitator's policy by aligning it to the inferred action distribution. MAAD leads to significantly improved sample efficiency and stability. We demonstrate its effectiveness in a number of MuJoCo environments, both int the OpenAI Gym and the DeepMind Control Suite. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods. Remarkably, MAAD often stands out as the sole method capable of attaining expert performance levels, underscoring its simplicity and efficacy.
