Table of Contents
Fetching ...

DreamDrive: Generative 4D Scene Modeling from Street View Images

Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, Yue Wang

TL;DR

DreamDrive tackles the challenge of generating 4D driving scenes from ego trajectories by merging the generative power of video diffusion priors with geometry-aware 3D Gaussian splatting. It introduces a self-supervised hybrid Gaussian representation that separately models static backgrounds and dynamic objects, aided by spatio-temporal clustering, to produce 4D scenes that render 3D-consistent driving videos. The approach enables controllable, generalizable scene generation from in-the-wild data and improves downstream perception and planning tasks. Evaluations on nuScenes and Street View demonstrate strong 3D consistency, high-fidelity novel-view synthesis, and practical gains in perception and planning when using DreamDrive-derived data.

Abstract

Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.

DreamDrive: Generative 4D Scene Modeling from Street View Images

TL;DR

DreamDrive tackles the challenge of generating 4D driving scenes from ego trajectories by merging the generative power of video diffusion priors with geometry-aware 3D Gaussian splatting. It introduces a self-supervised hybrid Gaussian representation that separately models static backgrounds and dynamic objects, aided by spatio-temporal clustering, to produce 4D scenes that render 3D-consistent driving videos. The approach enables controllable, generalizable scene generation from in-the-wild data and improves downstream perception and planning tasks. Evaluations on nuScenes and Street View demonstrate strong 3D consistency, high-fidelity novel-view synthesis, and practical gains in perception and planning when using DreamDrive-derived data.

Abstract

Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.
Paper Structure (10 sections, 14 equations, 9 figures, 3 tables)

This paper contains 10 sections, 14 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: An overview of DreamDrive. Given an input image, our method can generate a 4D spatio-temporal driving scene, where we can render 3D-consistent dynamic driving videos with any driving trajectories.
  • Figure 2: DreamDrive model pipeline. Given an input control image $I_{ctrl}$, our method first generates a set of reference images $I_{ref}$ using a video diffusion model $F_{VDM}$. These reference images are then lifted into 3D space via a multiview stereo network $F_{MVS}$, which provides camera information and dense 3D scene geometry to initialize 3D Gaussians. Next, we employ a self-supervised scoring network $F_{score}$ to separate the 3D Gaussians into static and dynamic components, followed by a clustering-based grouping strategy that creates hybrid Gaussian representations for modeling static structures and dynamic objects in the 4D spatio-temporal driving scene. Finally, we optimize the 4D scene using supervision from the reference images. During inference, given a driving trajectory, a novel-view driving video can be synthesized by splatting the hybrid Gaussian representations into images at each timestep.
  • Figure 3: Controllability of DreamDrive. Our method generates 3D Gaussian scenes with map and object control.
  • Figure 4: Generalization ability of DreamDrive. Given an image from anywhere in the world, our method can generate a 4D scene and render 3D-consistent driving videos from the 4D scene. This eliminates the requirement for specialized data collection and enables us to drive everywhere in the 3D world.
  • Figure 5: Novel-view driving video synthesis in DreamDrive. Our method can generate geometry-consistent driving videos with different driving trajectories.
  • ...and 4 more figures