Table of Contents
Fetching ...

ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos

Xiaodong Wang, Peixi Peng

TL;DR

Proposed ProphetDWM is a novel end-to-end driving world model that jointly predicts future videos and actions and achieves the best video consistency and best action prediction accuracy, while also enabling high-quality long-term video and action generation.

Abstract

Real-world driving requires people to observe the current environment, anticipate the future, and make appropriate driving decisions. This requirement is aligned well with the capabilities of world models, which understand the environment and predict the future. However, recent world models in autonomous driving are built explicitly, where they could predict the future by controllable driving video generation. We argue that driving world models should have two additional abilities: action control and action prediction. Following this line, previous methods are limited because they predict the video requires given actions of the same length as the video and ignore the dynamical action laws. To address these issues, we propose ProphetDWM, a novel end-to-end driving world model that jointly predicts future videos and actions. Our world model has an action module to learn latent action from the present to the future period by giving the action sequence and observations. And a diffusion-model-based transition module to learn the state distribution. The model is jointly trained by learning latent actions given finite states and predicting action and video. The joint learning connects the action dynamics and states and enables long-term future prediction. We evaluate our method in video generation and action prediction tasks on the Nuscenes dataset. Compared to the state-of-the-art methods, our method achieves the best video consistency and best action prediction accuracy, while also enabling high-quality long-term video and action generation.

ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos

TL;DR

Proposed ProphetDWM is a novel end-to-end driving world model that jointly predicts future videos and actions and achieves the best video consistency and best action prediction accuracy, while also enabling high-quality long-term video and action generation.

Abstract

Real-world driving requires people to observe the current environment, anticipate the future, and make appropriate driving decisions. This requirement is aligned well with the capabilities of world models, which understand the environment and predict the future. However, recent world models in autonomous driving are built explicitly, where they could predict the future by controllable driving video generation. We argue that driving world models should have two additional abilities: action control and action prediction. Following this line, previous methods are limited because they predict the video requires given actions of the same length as the video and ignore the dynamical action laws. To address these issues, we propose ProphetDWM, a novel end-to-end driving world model that jointly predicts future videos and actions. Our world model has an action module to learn latent action from the present to the future period by giving the action sequence and observations. And a diffusion-model-based transition module to learn the state distribution. The model is jointly trained by learning latent actions given finite states and predicting action and video. The joint learning connects the action dynamics and states and enables long-term future prediction. We evaluate our method in video generation and action prediction tasks on the Nuscenes dataset. Compared to the state-of-the-art methods, our method achieves the best video consistency and best action prediction accuracy, while also enabling high-quality long-term video and action generation.

Paper Structure

This paper contains 29 sections, 11 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: The comparison of video-action generation paradigms. We compare our method with previous methods that can perform action prediction. ADriver-I jia2023adriver and DriveDreamer wang2023drivedreamer belong to two-stage methods, because they need first predict the video, and then predict the subsequent actions, and their video prediction are conditioned on corresponding actions. In contrast, our method learns from the given action sequence and current observation and jointly predicts future actions and video in a one-stage manner.
  • Figure 2: Schematic diagram of proposed ProphetDWM. Our world model has an action module to learn the latent action and a video prediction module to learn state distribution. All modules are optimized together to jointly predict future actions and videos.
  • Figure 3: Light-weight action model. This model learns the latent action features given an action sequence and observations,
  • Figure 4: World model prediction by our proposed method. (Resolution of video is 256$\times$448. The orange indicates the key frame with Ground-truth actions. The blue indicates the key frame with the predicted actions, and the rest are non-key frames.)
  • Figure 5: Video prediction comparison. Although different models have different resolutions for inference, CogVideoX-2b-nus(CogVX-nus) uses 480$\times$720, Vista uses 576$\times$1024, our model uses 256$\times$448, our model shows better quality and clarity, avoiding misaligned starting frame and corrupted results.
  • ...and 6 more figures