Table of Contents
Fetching ...

Reward-free World Models for Online Imitation Learning

Shangzhe Li, Zhiao Huang, Hao Su

TL;DR

This work tackles online imitation learning in complex, high-dimensional environments by introducing reward-free world models that learn latent dynamics without reconstruction. It reframes optimization in the $Q$-policy space via an inverse soft-$Q$ objective, enabling stable training without explicit reward modeling and enabling planning with MPC over latent trajectories. The proposed IQ-MPC framework uses separate expert and behavioral replay buffers, a prediction-consistency loss, and a gradient-penalized inverse soft-$Q$ objective to learn a robust latent world model and policy prior. Empirically, IQ-MPC achieves stable, expert-level performance on DMControl, MyoSuite, and ManiSkill2, including visual observation tasks, and ablations confirm the importance of objective formulation, gradient penalties, and planning-based control for performance and stability.

Abstract

Imitation learning (IL) enables agents to acquire skills directly from expert demonstrations, providing a compelling alternative to reinforcement learning. However, prior online IL approaches struggle with complex tasks characterized by high-dimensional inputs and complex dynamics. In this work, we propose a novel approach to online imitation learning that leverages reward-free world models. Our method learns environmental dynamics entirely in latent spaces without reconstruction, enabling efficient and accurate modeling. We adopt the inverse soft-Q learning objective, reformulating the optimization process in the Q-policy space to mitigate the instability associated with traditional optimization in the reward-policy space. By employing a learned latent dynamics model and planning for control, our approach consistently achieves stable, expert-level performance in tasks with high-dimensional observation or action spaces and intricate dynamics. We evaluate our method on a diverse set of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating superior empirical performance compared to existing approaches.

Reward-free World Models for Online Imitation Learning

TL;DR

This work tackles online imitation learning in complex, high-dimensional environments by introducing reward-free world models that learn latent dynamics without reconstruction. It reframes optimization in the -policy space via an inverse soft- objective, enabling stable training without explicit reward modeling and enabling planning with MPC over latent trajectories. The proposed IQ-MPC framework uses separate expert and behavioral replay buffers, a prediction-consistency loss, and a gradient-penalized inverse soft- objective to learn a robust latent world model and policy prior. Empirically, IQ-MPC achieves stable, expert-level performance on DMControl, MyoSuite, and ManiSkill2, including visual observation tasks, and ablations confirm the importance of objective formulation, gradient penalties, and planning-based control for performance and stability.

Abstract

Imitation learning (IL) enables agents to acquire skills directly from expert demonstrations, providing a compelling alternative to reinforcement learning. However, prior online IL approaches struggle with complex tasks characterized by high-dimensional inputs and complex dynamics. In this work, we propose a novel approach to online imitation learning that leverages reward-free world models. Our method learns environmental dynamics entirely in latent spaces without reconstruction, enabling efficient and accurate modeling. We adopt the inverse soft-Q learning objective, reformulating the optimization process in the Q-policy space to mitigate the instability associated with traditional optimization in the reward-policy space. By employing a learned latent dynamics model and planning for control, our approach consistently achieves stable, expert-level performance in tasks with high-dimensional observation or action spaces and intricate dynamics. We evaluate our method on a diverse set of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating superior empirical performance compared to existing approaches.

Paper Structure

This paper contains 47 sections, 3 theorems, 22 equations, 18 figures, 7 tables, 2 algorithms.

Key Result

Lemma 4.1

Given an unknown latent MDP $\mathcal{M}$ and our learned latent MDP $\hat{\mathcal{M}}$ with transition probabilities $d$ and $\hat{d}$ in the latent state space $\mathcal{Z}$ and action space $\mathcal{A}$, and letting $R_{\max}$ denote the maximum reward of the unknown MDP, the difference between

Figures (18)

  • Figure 1: IQ-MPC We demonstrate the training workflow for IQ-MPC. The reward-free world model leverages both expert and behavioral data for training, using objectives in Section \ref{['sec:learning-process']}. The policy prior from the world model guides the MPPI planning process along with rewards decoded from Q estimations. The detailed planning process is revealed in Algorithm \ref{['alg:inference']}.
  • Figure 2: Locomotion Results Our method demonstrates much stabler performance near expert level compared to baseline methods. In the plots, blue lines refers to the online version of IQL+SAC garg2021iq, orange lines refers to the HyPE method ren2024hybrid, purple lines refers to the CFIL+SAC freund2023coupled baseline and red lines refers to our IQ-MPC model. The dotted green lines are the mean episode reward for the expert trajectories used during training.
  • Figure 3: Manipulation Results in MyoSuite Our IQ-MPC shows stable and outperforming results in MyoSuite manipulation experiments with dexterous hands. In the plots, the color settings are the same as those in Figure \ref{['fig:locomotion-results']}. In the Pen Twirl task, the CFIL+SAC agent is unable to train after 20K time steps. Thus, we interpolate the rest of the time steps with a straight line in the plot.
  • Figure 4: Results for Visual Experiments Our IQ-MPC (red lines) shows stable and expert-level results in visual observation tasks. In the plots, we denote the IQL+SAC with an additional convolutional encoder as IQL+SAC (Visual) (blue lines). Our model outperforms IQL+SAC (Visual) in the Cheetah Run and Walker Run, and it has comparable performance in the Walker Walk task. The expert trajectories used for training are sampled from TD-MPC2 trained on visual observations.
  • Figure 5: Ablation on Expert Trajectory Numbers. Performance of IQ-MPC with varying numbers of expert trajectories. Stable expert-level performance is achieved with only 10 expert demonstrations for Hopper Hop (top) and 5 for Object Hold (bottom).
  • ...and 13 more figures

Theorems & Definitions (7)

  • Lemma 4.1: Bounded Suboptimality
  • Definition 8.1
  • Definition 8.2
  • Lemma 8.3: Objective Equivalence
  • proof
  • Theorem 8.4: Policy Update
  • proof