Table of Contents
Fetching ...

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, Boyuan Wang, Youyi Zhang, Wenjun Mei, Xingang Wang

TL;DR

DriveDreamer4D tackles the limitations of current 4D driving scene reconstruction by integrating world-prior video generation to synthesize diverse, structurally constrained novel trajectories. It introduces the Novel Trajectory Generation Module (NTGM) to automate complex maneuver data and the Cousin Data Training Strategy (CDTS) to safely fuse real and synthetic data for 4D Gaussian Splatting. Empirical results on Waymo scenes show significant gains in FID and spatiotemporal coherence (NTA-IoU/NTL-IoU) across multiple baselines, along with strong user-preference indications. The approach enables high-fidelity, closed-loop driving simulations with realistic motion dynamics and consistent scene structure.

Abstract

Closed-loop simulation is essential for advancing end-to-end autonomous driving systems. Contemporary sensor simulation methods, such as NeRF and 3DGS, rely predominantly on conditions closely aligned with training data distributions, which are largely confined to forward-driving scenarios. Consequently, these methods face limitations when rendering complex maneuvers (e.g., lane change, acceleration, deceleration). Recent advancements in autonomous-driving world models have demonstrated the potential to generate diverse driving videos. However, these approaches remain constrained to 2D video generation, inherently lacking the spatiotemporal coherence required to capture intricacies of dynamic driving environments. In this paper, we introduce DriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors. Specifically, we utilize the world model as a data machine to synthesize novel trajectory videos, where structured conditions are explicitly leveraged to control the spatial-temporal consistency of traffic elements. Besides, the cousin data training strategy is proposed to facilitate merging real and synthetic data for optimizing 4DGS. To our knowledge, DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios. Experimental results reveal that DriveDreamer4D significantly enhances generation quality under novel trajectory views, achieving a relative improvement in FID by 32.1%, 46.4%, and 16.3% compared to PVG, S3Gaussian, and Deformable-GS. Moreover, DriveDreamer4D markedly enhances the spatiotemporal coherence of driving agents, which is verified by a comprehensive user study and the relative increases of 22.6%, 43.5%, and 15.6% in the NTA-IoU metric.

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

TL;DR

DriveDreamer4D tackles the limitations of current 4D driving scene reconstruction by integrating world-prior video generation to synthesize diverse, structurally constrained novel trajectories. It introduces the Novel Trajectory Generation Module (NTGM) to automate complex maneuver data and the Cousin Data Training Strategy (CDTS) to safely fuse real and synthetic data for 4D Gaussian Splatting. Empirical results on Waymo scenes show significant gains in FID and spatiotemporal coherence (NTA-IoU/NTL-IoU) across multiple baselines, along with strong user-preference indications. The approach enables high-fidelity, closed-loop driving simulations with realistic motion dynamics and consistent scene structure.

Abstract

Closed-loop simulation is essential for advancing end-to-end autonomous driving systems. Contemporary sensor simulation methods, such as NeRF and 3DGS, rely predominantly on conditions closely aligned with training data distributions, which are largely confined to forward-driving scenarios. Consequently, these methods face limitations when rendering complex maneuvers (e.g., lane change, acceleration, deceleration). Recent advancements in autonomous-driving world models have demonstrated the potential to generate diverse driving videos. However, these approaches remain constrained to 2D video generation, inherently lacking the spatiotemporal coherence required to capture intricacies of dynamic driving environments. In this paper, we introduce DriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors. Specifically, we utilize the world model as a data machine to synthesize novel trajectory videos, where structured conditions are explicitly leveraged to control the spatial-temporal consistency of traffic elements. Besides, the cousin data training strategy is proposed to facilitate merging real and synthetic data for optimizing 4DGS. To our knowledge, DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios. Experimental results reveal that DriveDreamer4D significantly enhances generation quality under novel trajectory views, achieving a relative improvement in FID by 32.1%, 46.4%, and 16.3% compared to PVG, S3Gaussian, and Deformable-GS. Moreover, DriveDreamer4D markedly enhances the spatiotemporal coherence of driving agents, which is verified by a comprehensive user study and the relative increases of 22.6%, 43.5%, and 15.6% in the NTA-IoU metric.

Paper Structure

This paper contains 20 sections, 13 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Previous 4D Gaussian Splatting methods (e.g., PVG pvg, $\text{S}^3\text{Gaussian}$s3gaussian, Deformable-GS deformablegs) face challenges in rendering novel trajectories, such as lane change. DriveDreamer4D addresses this by enhancing 4D driving scene representation via integrating priors from world models, significantly improving rendering quality under complex scenarios and novel trajectory viewpoints.
  • Figure 2: The overall framework of DriveDreamer4D. Initially, by altering the actions of the original trajectory (e.g., steering angle, speed), new trajectories can be obtained. Conditioned on the first frame and the structured information (3D bounding boxes, HDMap) from the new trajectory, the novel trajectory videos are generated. Subsequently, the temporal-aligned cousin pair (original and novel trajectory videos) are merged to optimize the 4D Gaussian Splatting model, where a regularization loss is calculated to ensure perceptual coherence.
  • Figure 3: Qualitative comparisons of novel trajectory renderings during lane change scenarios. The orange boxes highlight that DriveDreamer4D significantly enhances the rendering quality across various baselines (PVG pvg, $\text{S}^3$Gaussian s3gaussian, Deformable-GS deformablegs).
  • Figure 4: Visual comparisons in the novel trajectories for the Cousin Data Training Strategy (CDTS) ablation study. The orange boxes emphasize the superior performance of DriveDreamer4D and the further improvements in detail rendering brought by CDTS.
  • Figure 5: Qualitative comparisons of novel trajectory renderings during speed change scenarios. The orange boxes highlight that DriveDreamer4D significantly enhances the rendering quality across various baseline methods (PVG pvg, $\text{S}^3$Gaussian s3gaussian, Deformable-GS deformablegs).