Table of Contents
Fetching ...

MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, Qiang Xu

TL;DR

MagicDrive3D proposes a two-stage framework that first learns a multi-view video generator conditioned on road maps, 3D boxes, and text, producing consistent views, and then reconstructs a 3D Gaussian Splatting scene (3DGS) with Fault-Tolerant Splatting and depth/exposure priors. The approach enables any-view rendering of controllable street scenes and reduces data collection needs by leveraging standard autonomous driving datasets like nuScenes. It demonstrates improvements in novel-view fidelity, video quality, and perception-task data augmentation (e.g., BEV segmentation), while providing applications for object-level dynamics and scene editing. The work provides a practical path toward realistic, controllable 3D street simulations with broad potential for autonomous driving and beyond.

Abstract

Controllable generative models for images and videos have seen significant success, yet 3D scene generation, especially in unbounded scenarios like autonomous driving, remains underdeveloped. Existing methods lack flexible controllability and often rely on dense view data collection in controlled environments, limiting their generalizability across common datasets (e.g., nuScenes). In this paper, we introduce MagicDrive3D, a novel framework for controllable 3D street scene generation that combines video-based view synthesis with 3D representation (3DGS) generation. It supports multi-condition control, including road maps, 3D objects, and text descriptions. Unlike previous approaches that require 3D representation before training, MagicDrive3D first trains a multi-view video generation model to synthesize diverse street views. This method utilizes routinely collected autonomous driving data, reducing data acquisition challenges and enriching 3D scene generation. In the 3DGS generation step, we introduce Fault-Tolerant Gaussian Splatting to address minor errors and use monocular depth for better initialization, alongside appearance modeling to manage exposure discrepancies across viewpoints. Experiments show that MagicDrive3D generates diverse, high-quality 3D driving scenes, supports any-view rendering, and enhances downstream tasks like BEV segmentation, demonstrating its potential for autonomous driving simulation and beyond.

MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

TL;DR

MagicDrive3D proposes a two-stage framework that first learns a multi-view video generator conditioned on road maps, 3D boxes, and text, producing consistent views, and then reconstructs a 3D Gaussian Splatting scene (3DGS) with Fault-Tolerant Splatting and depth/exposure priors. The approach enables any-view rendering of controllable street scenes and reduces data collection needs by leveraging standard autonomous driving datasets like nuScenes. It demonstrates improvements in novel-view fidelity, video quality, and perception-task data augmentation (e.g., BEV segmentation), while providing applications for object-level dynamics and scene editing. The work provides a practical path toward realistic, controllable 3D street simulations with broad potential for autonomous driving and beyond.

Abstract

Controllable generative models for images and videos have seen significant success, yet 3D scene generation, especially in unbounded scenarios like autonomous driving, remains underdeveloped. Existing methods lack flexible controllability and often rely on dense view data collection in controlled environments, limiting their generalizability across common datasets (e.g., nuScenes). In this paper, we introduce MagicDrive3D, a novel framework for controllable 3D street scene generation that combines video-based view synthesis with 3D representation (3DGS) generation. It supports multi-condition control, including road maps, 3D objects, and text descriptions. Unlike previous approaches that require 3D representation before training, MagicDrive3D first trains a multi-view video generation model to synthesize diverse street views. This method utilizes routinely collected autonomous driving data, reducing data acquisition challenges and enriching 3D scene generation. In the 3DGS generation step, we introduce Fault-Tolerant Gaussian Splatting to address minor errors and use monocular depth for better initialization, alongside appearance modeling to manage exposure discrepancies across viewpoints. Experiments show that MagicDrive3D generates diverse, high-quality 3D driving scenes, supports any-view rendering, and enhances downstream tasks like BEV segmentation, demonstrating its potential for autonomous driving simulation and beyond.
Paper Structure (24 sections, 3 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Rendered panorama of the street scene generated from MagicDrive3D. With conditional controls from 3D bounding boxes of objects, BEV road map, ego trajectory, and text descriptions (e.g., timeofday), MagicDrive3D generates complex open-world 3D scenes represented by deformable Gaussians.
  • Figure 2: Method Overview of MagicDrive3D. For controllable street 3D scene generation, MagicDrive3D decomposes the task into two steps: ① conditional multi-view video generation, which tackles the control signals and generates consistent view priors to the novel scene; and ② Gaussian Splatting generation with our Enhanced GS pipeline, which supports various viewpoint rendering (e.g., panorama).
  • Figure 3: Illustration of the local inconsistency from two successive generated frames of Front-Left (FL) camera. Even though our video generation model retains fine 3D consistency, minor discrepancies are inevitable. Our FTGS can effectively reconstruct the scene with awareness of such discrepancy.
  • Figure 4: We optimize the monocular depths (a) with 2 steps for better alignment: coarse scale/offset estimation with SfM PCD (b) and GS optimization (c).
  • Figure 5: Qualitative comparison with NF-LDM kim2023neuralfield. Our method can generate higher quality 3D scenes while maintaining better object geometry, with stronger controllability compared to NF-LDM (see the significant object deformation in NF-LDM). Panoramas for GS are transformed and stitched from perspective cameras with $90^\circ$ FOV. Views in the last row are rendered with unseen camera rigs of nuScenes.
  • ...and 7 more figures