Table of Contents
Fetching ...

DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, Liping Jing, Yiming Nie, Bin Dai

TL;DR

DriveWorld tackles 4D scene understanding for vision-centric autonomous driving by learning a spatio-temporal BEV representation from multi-camera video using world models. It formulates latent dynamics with two variables, $h_t$ (history) and $s_t$ (stochastic state), processed by a Dynamic Memory Bank and a Static Scene Propagation module to predict current and future occupancy over horizon $L$ and past horizon $T$. A Task Prompt, generated from a text encoder, conditions feature extraction for downstream tasks, improving task-specific representations. Pre-training on nuScenes and OpenScene yields consistent gains across 3D object detection, online mapping, multi-object tracking, motion forecasting, occupancy prediction, and planning, with additional boosts from OpenScene data. The results suggest DriveWorld provides a scalable, unified 4D representation with reduced data requirements for downstream autonomy tasks.

Abstract

Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework, dubbed \emph{DriveWorld}, which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically, we propose a Memory State-Space Model for spatio-temporal modelling, which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy prediction, and a 0.34m reduction in average L2 error for planning.

DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

TL;DR

DriveWorld tackles 4D scene understanding for vision-centric autonomous driving by learning a spatio-temporal BEV representation from multi-camera video using world models. It formulates latent dynamics with two variables, (history) and (stochastic state), processed by a Dynamic Memory Bank and a Static Scene Propagation module to predict current and future occupancy over horizon and past horizon . A Task Prompt, generated from a text encoder, conditions feature extraction for downstream tasks, improving task-specific representations. Pre-training on nuScenes and OpenScene yields consistent gains across 3D object detection, online mapping, multi-object tracking, motion forecasting, occupancy prediction, and planning, with additional boosts from OpenScene data. The results suggest DriveWorld provides a scalable, unified 4D representation with reduced data requirements for downstream autonomy tasks.

Abstract

Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework, dubbed \emph{DriveWorld}, which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically, we propose a Memory State-Space Model for spatio-temporal modelling, which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy prediction, and a 0.34m reduction in average L2 error for planning.
Paper Structure (39 sections, 17 equations, 7 figures, 8 tables)

This paper contains 39 sections, 17 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison on different pre-training methods for vision-centric autonomous driving. (a) Monocular 2D pre-training with 2D pre-text tasks (e.g., 2D classification and depth estimation). (b) Multi-camera 3D pre-training via 3D scene reconstruction or 3D object detection. (c) The proposed 4D pre-training based on world models learns unified spatio-temporal representations.
  • Figure 2: Overall framework of the proposed DriveWorld. Since autonomous driving heavily relies on the understanding of 4D scenes, our approach first involves the transformation of multi-camera images into a 4D space. Within the proposed Memory State-Space Model for spatio-temporal modelling, we have two essential components: the Dynamic Memory Bank, which learns temporal-aware latent dynamics for predicting future states, and the Static Scene Propagation, which learns spatial-aware latent statics to provide comprehensive scene context. This configuration facilitates the decoder's task of reconstructing 3D occupancy and actions for both the current and future time steps. Besides, we design the Task Prompt based on a pre-trained text encoder to adaptively decouple task-aware features for various tasks.
  • Figure 3: Overall architecture of proposed the Memory State-Sapce Model (MSSM). MSSM divides the transmitted information into two categories: temporal-aware information and spatial-aware information. The Dynamic Memory Bank module utilizes motion-aware layer normalization (MLN) to encode temporal-aware attributes and engages in information interaction with the dynamically updated memory bank. Meanwhile, the Static Scene Propagation module employs BEV features to represent spatial-aware latent statics, which are directly conveyed to the decoder.
  • Figure 4: Visual comparison between UniAD uniad (Top) and our DriveWorld (Bottom).
  • Figure 5: Visualization of BEV feature maps when prompting with different tasks.
  • ...and 2 more figures