Table of Contents
Fetching ...

DITTO: Offline Imitation Learning with World Models

Branton DeMoss, Paul Duckworth, Jakob Foerster, Nick Hawes, Ingmar Posner

TL;DR

DITTO introduces Dream Imitation, an offline imitation learning algorithm that leverages a learned world model to perform policy learning in latent space. By optimizing a multi-step latent-space divergence from expert trajectories and using an intrinsic reward based on latent similarity, DITTO cast the imitation problem as online RL within the world model, avoiding online environment access and adversarial rewards. Theoretical connections show that minimizing latent divergence bounds the true return gap in the real environment, and empirically DITTO achieves state-of-the-art, data-efficient imitation on pixel-based Atari benchmarks, outperforming BC and adversarial baselines while remaining robust to covariate shift. The approach demonstrates how world models can enable scalable, offline, high-dimensional imitation learning with practical impact for real-world deployment.

Abstract

For imitation learning algorithms to scale to real-world challenges, they must handle high-dimensional observations, offline learning, and policy-induced covariate-shift. We propose DITTO, an offline imitation learning algorithm which addresses all three of these problems. DITTO optimizes a novel distance metric in the latent space of a learned world model: First, we train a world model on all available trajectory data, then, the imitation agent is unrolled from expert start states in the learned model, and penalized for its latent divergence from the expert dataset over multiple time steps. We optimize this multi-step latent divergence using standard reinforcement learning algorithms, which provably induces imitation learning, and empirically achieves state-of-the art performance and sample efficiency on a range of Atari environments from pixels, without any online environment access. We also adapt other standard imitation learning algorithms to the world model setting, and show that this considerably improves their performance. Our results show how creative use of world models can lead to a simple, robust, and highly-performant policy-learning framework.

DITTO: Offline Imitation Learning with World Models

TL;DR

DITTO introduces Dream Imitation, an offline imitation learning algorithm that leverages a learned world model to perform policy learning in latent space. By optimizing a multi-step latent-space divergence from expert trajectories and using an intrinsic reward based on latent similarity, DITTO cast the imitation problem as online RL within the world model, avoiding online environment access and adversarial rewards. Theoretical connections show that minimizing latent divergence bounds the true return gap in the real environment, and empirically DITTO achieves state-of-the-art, data-efficient imitation on pixel-based Atari benchmarks, outperforming BC and adversarial baselines while remaining robust to covariate shift. The approach demonstrates how world models can enable scalable, offline, high-dimensional imitation learning with practical impact for real-world deployment.

Abstract

For imitation learning algorithms to scale to real-world challenges, they must handle high-dimensional observations, offline learning, and policy-induced covariate-shift. We propose DITTO, an offline imitation learning algorithm which addresses all three of these problems. DITTO optimizes a novel distance metric in the latent space of a learned world model: First, we train a world model on all available trajectory data, then, the imitation agent is unrolled from expert start states in the learned model, and penalized for its latent divergence from the expert dataset over multiple time steps. We optimize this multi-step latent divergence using standard reinforcement learning algorithms, which provably induces imitation learning, and empirically achieves state-of-the art performance and sample efficiency on a range of Atari environments from pixels, without any online environment access. We also adapt other standard imitation learning algorithms to the world model setting, and show that this considerably improves their performance. Our results show how creative use of world models can lead to a simple, robust, and highly-performant policy-learning framework.
Paper Structure (16 sections, 1 theorem, 20 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 1 theorem, 20 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Corollary A.1

Suppose we also have another imitation learner, which uses the same data-set of size N, and still satisfies Assumption 3, but instead trains on some other intrinsic reward, $R_{\text{int}}'$ which satisfies (for some $\epsilon > 0)$: Let $\rho^J$ be the limiting state-action distribution of this imitation learner. Then:

Figures (4)

  • Figure 1: The learner begins from random expert latent states during training, and generates on-policy latent trajectories in the world model. The intrinsic reward \ref{['intreward']} encourages the learner to recover from its mistakes over multiple time steps to match the expert trajectory.
  • Figure 2: We compare mean extrinsic reward from rollouts in the true environment (BeamRider) throughout agent training (left) to agents' mean latent distance from the expert (center), and mean expert action prediction accuracy (right). Both latent distance and accuracy are calculated on held-out expert trajectories used for validation. Latent distance is defined as $L_d=1-r_{int}$. DITTO explicitly minimizes this quantity, and achieves the greatest generalization performance in the true environment. Perfect agreement with the expert would result in $L_d = 0$, but this is impossible to achieve since the world model is stochastic. Counter-intuitively, expert action prediction accuracy is negatively correlated with generalization performance in the true environment.
  • Figure 3: Results on five Atari environments from pixels, with fixed horizon $H=15$. In all environments, DITTO matches or exceeds expert performance, and matches or exceeds all baselines. In MsPacman and Qbert, all model-based methods immediately recover expert performance with minimal data. In MsPacman, we observe adversarial collapse of D-GAIL. We follow Agarwal2021 for offline policy evaluation, and report the mean reward achieved across 10 gradient steps with 20 validation simulations, to avoid lottery-ticket policy results. Shaded regions show $\pm 1$ standard error. The experts are strong pre-trained PPO agents from the RL Baselines3 Zoo.
  • Figure 4: Left: Results on continuous control environment BipedalWalker, from pixels. Right: Training time horizon ablation. Note that both DITTO and D-GAIL achieve their maximum performance at a similar training time horizon. We conjecture that this hyperparameter is environment-specific, and report results for all environments with fixed $H=15$.

Theorems & Definitions (2)

  • Corollary A.1
  • proof