Table of Contents
Fetching ...

EvIL: Evolution Strategies for Generalisable Imitation Learning

Silvia Sapora, Gokul Swamy, Chris Lu, Yee Whye Teh, Jakob Nicolaus Foerster

TL;DR

The paper tackles the difficulty of transferring imitation policies when the training and deployment environments differ, focusing on poorly shaped rewards produced by modern IRL methods. It introduces IRL++, a set of practical adjustments (reward model ensembles, policy buffers, random resets, and ensembles) to improve retraining from recovered rewards, and EvIL, an ES-based framework to optimise a potential-based shaping term that speeds up retraining in target environments. The authors demonstrate that IRL++ enables effective retraining, while EvIL significantly improves interaction efficiency and transfer performance across continuous control tasks in MuJoCo (Hopper, Walker, Ant). The work offers a scalable, simulator-friendly approach to producing generalisable imitation policies with stronger retraining guarantees and practical transfer capabilities.

Abstract

Often times in imitation learning (IL), the environment we collect expert demonstrations in and the environment we want to deploy our learned policy in aren't exactly the same (e.g. demonstrations collected in simulation but deployment in the real world). Compared to policy-centric approaches to IL like behavioural cloning, reward-centric approaches like inverse reinforcement learning (IRL) often better replicate expert behaviour in new environments. This transfer is usually performed by optimising the recovered reward under the dynamics of the target environment. However, (a) we find that modern deep IL algorithms frequently recover rewards which induce policies far weaker than the expert, even in the same environment the demonstrations were collected in. Furthermore, (b) these rewards are often quite poorly shaped, necessitating extensive environment interaction to optimise effectively. We provide simple and scalable fixes to both of these concerns. For (a), we find that reward model ensembles combined with a slightly different training objective significantly improves re-training and transfer performance. For (b), we propose a novel evolution-strategies based method EvIL to optimise for a reward-shaping term that speeds up re-training in the target environment, closing a gap left open by the classical theory of IRL. On a suite of continuous control tasks, we are able to re-train policies in target (and source) environments more interaction-efficiently than prior work.

EvIL: Evolution Strategies for Generalisable Imitation Learning

TL;DR

The paper tackles the difficulty of transferring imitation policies when the training and deployment environments differ, focusing on poorly shaped rewards produced by modern IRL methods. It introduces IRL++, a set of practical adjustments (reward model ensembles, policy buffers, random resets, and ensembles) to improve retraining from recovered rewards, and EvIL, an ES-based framework to optimise a potential-based shaping term that speeds up retraining in target environments. The authors demonstrate that IRL++ enables effective retraining, while EvIL significantly improves interaction efficiency and transfer performance across continuous control tasks in MuJoCo (Hopper, Walker, Ant). The work offers a scalable, simulator-friendly approach to producing generalisable imitation policies with stronger retraining guarantees and practical transfer capabilities.

Abstract

Often times in imitation learning (IL), the environment we collect expert demonstrations in and the environment we want to deploy our learned policy in aren't exactly the same (e.g. demonstrations collected in simulation but deployment in the real world). Compared to policy-centric approaches to IL like behavioural cloning, reward-centric approaches like inverse reinforcement learning (IRL) often better replicate expert behaviour in new environments. This transfer is usually performed by optimising the recovered reward under the dynamics of the target environment. However, (a) we find that modern deep IL algorithms frequently recover rewards which induce policies far weaker than the expert, even in the same environment the demonstrations were collected in. Furthermore, (b) these rewards are often quite poorly shaped, necessitating extensive environment interaction to optimise effectively. We provide simple and scalable fixes to both of these concerns. For (a), we find that reward model ensembles combined with a slightly different training objective significantly improves re-training and transfer performance. For (b), we propose a novel evolution-strategies based method EvIL to optimise for a reward-shaping term that speeds up re-training in the target environment, closing a gap left open by the classical theory of IRL. On a suite of continuous control tasks, we are able to re-train policies in target (and source) environments more interaction-efficiently than prior work.
Paper Structure (21 sections, 5 equations, 8 figures, 3 tables)

This paper contains 21 sections, 5 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: A long-standing concern with standard inverse reinforcement (IRL) algorithms is that they can recover poorly shaped reward functions as the moment-matching loss is invariant to potential-based shaping terms ng2000algorithms. We propose using evolution strategies to learn a shaping term on top of the reward recovered by inverse RL via directly optimising for efficient re-training. We are able to re-train policies in both source and unseen target environments more sample-efficiently than the prior work.
  • Figure 2: Shaping RL. Our method successfully recovers a potential-based reward function that helps an RL agent learn faster. We compare against the baseline of using the expert value function as a shaping term when retraining and non-shaped RL training on the real reward. We compute standard error across 5 seeds. $J(\pi)$ is the performance of the learner under the ground truth reward.
  • Figure 3: Shaping IRL. We show our method can successfully recover a potential-based reward function that makes the recovered reward function easier to learn. We use the shaping term combined with a reward recovered from IRL++. We compare against three baselines: the reward recovered from an ensemble of discriminators, without shaping, the reward recovered by a classic IRL method, and the IRL++ reward shaped using the expert value function when retraining. For each, we train on 5 seeds, with shading representing standard error.
  • Figure 4: EvIL Transfer on Trembling Hand EnvironmentEvIL outperforms both BC and IRL++ on transfer to an environment where, with $\epsilon$ probability, a random action is executed in the environment rather than the one the agent selected. IRL++ out-performs BC, highlighting the importance of interactive training for effective transfer.
  • Figure 5: EvIL Transfer on Randomised Dynamics EnvironmentEvIL outperforms both BC and IRL++ on transfer to an environment where link lengths and joint ranges are randomly sampled and differ from the demonstrations. As before, IRL++ out-performs BC.
  • ...and 3 more figures