Reward-free World Models for Online Imitation Learning
Shangzhe Li, Zhiao Huang, Hao Su
TL;DR
This work tackles online imitation learning in complex, high-dimensional environments by introducing reward-free world models that learn latent dynamics without reconstruction. It reframes optimization in the $Q$-policy space via an inverse soft-$Q$ objective, enabling stable training without explicit reward modeling and enabling planning with MPC over latent trajectories. The proposed IQ-MPC framework uses separate expert and behavioral replay buffers, a prediction-consistency loss, and a gradient-penalized inverse soft-$Q$ objective to learn a robust latent world model and policy prior. Empirically, IQ-MPC achieves stable, expert-level performance on DMControl, MyoSuite, and ManiSkill2, including visual observation tasks, and ablations confirm the importance of objective formulation, gradient penalties, and planning-based control for performance and stability.
Abstract
Imitation learning (IL) enables agents to acquire skills directly from expert demonstrations, providing a compelling alternative to reinforcement learning. However, prior online IL approaches struggle with complex tasks characterized by high-dimensional inputs and complex dynamics. In this work, we propose a novel approach to online imitation learning that leverages reward-free world models. Our method learns environmental dynamics entirely in latent spaces without reconstruction, enabling efficient and accurate modeling. We adopt the inverse soft-Q learning objective, reformulating the optimization process in the Q-policy space to mitigate the instability associated with traditional optimization in the reward-policy space. By employing a learned latent dynamics model and planning for control, our approach consistently achieves stable, expert-level performance in tasks with high-dimensional observation or action spaces and intricate dynamics. We evaluate our method on a diverse set of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating superior empirical performance compared to existing approaches.
