Table of Contents
Fetching ...

Imagine-2-Drive: Leveraging High-Fidelity World Models via Multi-Modal Diffusion Policies

Anant Garg, K Madhava Krishna

TL;DR

Imagine-2-Drive tackles long-horizon planning in autonomous driving by uniting a high-fidelity diffusion world model (DiffDreamer) with a multi-modal diffusion policy (DPA). DiffDreamer predicts $H$ future observations and rewards from $P$ past frames using Stable Video Diffusion, reducing error accumulation, while DPA samples diverse trajectory sequences via diffusion and is trained with PPO within the world-model imagination. Iterative training of DiffDreamer and DPA in CARLA Town04 yields superior route completion, success rate, and reduced infractions, alongside markedly better prediction fidelity (FID/FVD) than prior models. This framework advances sample-efficient, multimodal planning for autonomous driving and potentially broader robotics applications.

Abstract

World Model-based Reinforcement Learning (WMRL) enables sample efficient policy learning by reducing the need for online interactions which can potentially be costly and unsafe, especially for autonomous driving. However, existing world models often suffer from low prediction fidelity and compounding one-step errors, leading to policy degradation over long horizons. Additionally, traditional RL policies, often deterministic or single Gaussian-based, fail to capture the multi-modal nature of decision-making in complex driving scenarios. To address these challenges, we propose Imagine-2-Drive, a novel WMRL framework that integrates a high-fidelity world model with a multi-modal diffusion-based policy actor. It consists of two key components: DiffDreamer, a diffusion-based world model that generates future observations simultaneously, mitigating error accumulation, and DPA (Diffusion Policy Actor), a diffusion-based policy that models diverse and multi-modal trajectory distributions. By training DPA within DiffDreamer, our method enables robust policy learning with minimal online interactions. We evaluate our method in CARLA using standard driving benchmarks and demonstrate that it outperforms prior world model baselines, improving Route Completion and Success Rate by 15% and 20% respectively.

Imagine-2-Drive: Leveraging High-Fidelity World Models via Multi-Modal Diffusion Policies

TL;DR

Imagine-2-Drive tackles long-horizon planning in autonomous driving by uniting a high-fidelity diffusion world model (DiffDreamer) with a multi-modal diffusion policy (DPA). DiffDreamer predicts future observations and rewards from past frames using Stable Video Diffusion, reducing error accumulation, while DPA samples diverse trajectory sequences via diffusion and is trained with PPO within the world-model imagination. Iterative training of DiffDreamer and DPA in CARLA Town04 yields superior route completion, success rate, and reduced infractions, alongside markedly better prediction fidelity (FID/FVD) than prior models. This framework advances sample-efficient, multimodal planning for autonomous driving and potentially broader robotics applications.

Abstract

World Model-based Reinforcement Learning (WMRL) enables sample efficient policy learning by reducing the need for online interactions which can potentially be costly and unsafe, especially for autonomous driving. However, existing world models often suffer from low prediction fidelity and compounding one-step errors, leading to policy degradation over long horizons. Additionally, traditional RL policies, often deterministic or single Gaussian-based, fail to capture the multi-modal nature of decision-making in complex driving scenarios. To address these challenges, we propose Imagine-2-Drive, a novel WMRL framework that integrates a high-fidelity world model with a multi-modal diffusion-based policy actor. It consists of two key components: DiffDreamer, a diffusion-based world model that generates future observations simultaneously, mitigating error accumulation, and DPA (Diffusion Policy Actor), a diffusion-based policy that models diverse and multi-modal trajectory distributions. By training DPA within DiffDreamer, our method enables robust policy learning with minimal online interactions. We evaluate our method in CARLA using standard driving benchmarks and demonstrate that it outperforms prior world model baselines, improving Route Completion and Success Rate by 15% and 20% respectively.

Paper Structure

This paper contains 28 sections, 17 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Based on past and current observations, Imagine-2-Drive generates a waypoint trajectory using DPA. This trajectory, along with observations, is used to generate future observations and rewards from DiffDreamer corresponding to the trajectory.
  • Figure 2: Architecture: Imagine-2-Drive consists of a Diffusion Policy Actor (DPA)$\pi_{\theta}$ for trajectory prediction $\tau$ and DiffDreamer$\mathcal{M}{\phi}$ as a World Model for future state and reward prediction. (a) illustrates the overall pipeline: given the encoded state from the current and P past observations, $\pi_{\theta}$ denoises a set of one-hot embeddings over $K$ steps to generate $H$ future discrete actions, forming the final denoised trajectory $\tau_{t}^{0}$. This trajectory is further enriched using Fourier Embeddings and, along with past and current observations, is input to $\mathcal{M}_{\phi}$ to predict future $H$ observations and rewards. (b) details DiffDreamer, comprising two components: SVD for future observation prediction and an additional head for reward prediction. In SVD, the first $P$ noisy frames are replaced with past observations, while the current observation is repeated $(P+H)$ times and concatenated with past and noisy frames for better grounding with the initial conditions.
  • Figure 3: Iterative Training of DPA ($\pi_{\theta}$) and DiffDreamer ($\mathcal{M}_{\phi}$):$\pi_{\theta}$ and $\mathcal{M}_{\phi}$ are trained alternately, with one fixed while the other updates. $\mathcal{M}_{\phi}$ learns from $\pi_{\theta}$ rollouts in the simulator, while $\pi_{\theta}$ is optimized using a frozen $\mathcal{M}_{\phi}$. This iterative process ensures synchronization and improves training stability.
  • Figure 4: Multi-Modal Nature of DPA: The top-view visualization highlights the diverse behaviors generated over the episode by DPA under identical initial conditions but different seeds. Given the current and past observations as shown, DiffDreamer predicts future observations for two distinct trajectories: Left (Blue) and Right (Red), navigating around the Green car ahead. The predicted frames are color coded corresponding to the trajectory. This demonstrates both the multi-modal nature of DPA and the prediction-fidelity of DiffDreamer. (Please zoom-in for a better view)
  • Figure 5: DiffDreamer Future Prediction: Future observation predictions from DiffDreamer, conditioned on the input trajectory (shown in black dots $\cdots$) and current observations. Demonstrates our world model's ability to accurately predict future observations based on the provided context, highlighting its robust trajectory prediction capabilities. (Please zoom-in for a better view)
  • ...and 2 more figures