Table of Contents
Fetching ...

DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving

Yiyao Zhu, Ying Xue, Haiming Zhang, Guangfeng Jiang, Wending Zhou, Xu Yan, Jiantao Gao, Yingjie Cai, Bingbing Liu, Zhen Li, Shaojie Shen

Abstract

Vision-based autonomous driving has gained much attention due to its low costs and excellent performance. Compared with dense BEV (Bird's Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.

DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving

Abstract

Vision-based autonomous driving has gained much attention due to its low costs and excellent performance. Compared with dense BEV (Bird's Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.

Paper Structure

This paper contains 21 sections, 11 equations, 6 figures, 10 tables, 2 algorithms.

Figures (6)

  • Figure 1: Illustration of our DLWM for pre-training and performance improvements for downstream tasks.
  • Figure 2: Overall pipeline of DLWM. Stage 1 (Sec. \ref{['sec:3.1']}) focuses on learning robust 3D Gaussian scene representations from multi-view videos using self-supervised reconstruction on depth and semantic maps. Stage 2 introduces dual latent world models. a. Gaussian-flow-guided model (Sec. \ref{['sec:3.2']}) explicitly predicts 3D Gaussian flow, propagating the current Gaussian states to the future frame for latent prediction. b. Ego-planning-guided model (Sec. \ref{['sec:3.3']}) conditions the future scene forecasting on the predicted ego trajectory. All predicted latents are supervised by the perceived features from next multi-view image using a frozen Gaussian perception module.
  • Figure 3: 4D occupancy forecasting in a stream manner. We transform current 3D Gaussians to the next frame with ego motion alignment and complete new area with random Gaussians.
  • Figure 4: Qualitative results of depth and semantic reconstruction and forecasting during the pre-training. The left column shows the depth and semantic reconstruction at the current frame while the right column illustrates the depth and semantic images rendered by predicted Gaussians at the next frame compared with the GT RGB images.
  • Figure 5: Qualitative comparison of 3D Occupancy Perception over GT, baseline without pre-training and ours with pre-training.
  • ...and 1 more figures