Table of Contents
Fetching ...

HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

Zehuan Wu, Jingcheng Ni, Xiaodong Wang, Yuxin Guo, Rui Chen, Lewei Lu, Jifeng Dai, Yuwen Xiong

TL;DR

This work tackles the need for coherent generation of street scenes across 2D camera views and 3D LiDAR data for autonomous driving. It develops HoloDrive, a framework that jointly generates 2D multi-view images and 3D LiDAR point clouds by coupling a diffusion-based image generator with a VQ-VAE–based LiDAR tokenizer, interconnected via BEV↔Camera transforms and depth supervision, extended to temporal video. The approach achieves state-of-the-art or competitive results on NuScenes for both single-frame and video generation, and ablation studies validate the benefits of cross-modal interactions and progressive training. The proposed system enables more realistic neural simulators and end-to-end multi-modal scene synthesis conditioned on scene layouts and descriptions, with practical implications for training and evaluating autonomous agents.

Abstract

Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving. However, a real-world autonomous driving system uses multiple kinds of input modality, usually cameras and LiDARs, where they contain complementary information for generation, while existing generation methods ignore this crucial feature, resulting in the generated results only covering separate 2D or 3D information. In order to fill the gap in 2D-3D multi-modal joint generation for autonomous driving, in this paper, we propose our framework, \emph{HoloDrive}, to jointly generate the camera images and LiDAR point clouds. We employ BEV-to-Camera and Camera-to-BEV transform modules between heterogeneous generative models, and introduce a depth prediction branch in the 2D generative model to disambiguate the un-projecting from image space to BEV space, then extend the method to predict the future by adding temporal structure and carefully designed progressive training. Further, we conduct experiments on single frame generation and world model benchmarks, and demonstrate our method leads to significant performance gains over SOTA methods in terms of generation metrics.

HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

TL;DR

This work tackles the need for coherent generation of street scenes across 2D camera views and 3D LiDAR data for autonomous driving. It develops HoloDrive, a framework that jointly generates 2D multi-view images and 3D LiDAR point clouds by coupling a diffusion-based image generator with a VQ-VAE–based LiDAR tokenizer, interconnected via BEV↔Camera transforms and depth supervision, extended to temporal video. The approach achieves state-of-the-art or competitive results on NuScenes for both single-frame and video generation, and ablation studies validate the benefits of cross-modal interactions and progressive training. The proposed system enables more realistic neural simulators and end-to-end multi-modal scene synthesis conditioned on scene layouts and descriptions, with practical implications for training and evaluating autonomous agents.

Abstract

Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving. However, a real-world autonomous driving system uses multiple kinds of input modality, usually cameras and LiDARs, where they contain complementary information for generation, while existing generation methods ignore this crucial feature, resulting in the generated results only covering separate 2D or 3D information. In order to fill the gap in 2D-3D multi-modal joint generation for autonomous driving, in this paper, we propose our framework, \emph{HoloDrive}, to jointly generate the camera images and LiDAR point clouds. We employ BEV-to-Camera and Camera-to-BEV transform modules between heterogeneous generative models, and introduce a depth prediction branch in the 2D generative model to disambiguate the un-projecting from image space to BEV space, then extend the method to predict the future by adding temporal structure and carefully designed progressive training. Further, we conduct experiments on single frame generation and world model benchmarks, and demonstrate our method leads to significant performance gains over SOTA methods in terms of generation metrics.

Paper Structure

This paper contains 20 sections, 4 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Our pipeline HoloDrive can jointly generate realistic street scene video of surround-view cameras and LiDAR point cloud.
  • Figure 2: Overview of the proposed pipeline. a). The conditions used by our pipeline. b).The overall joint training and inference pipeline. c). The structure to convert BEV features for the image generation model. d). The structure to convert image features for the LiDAR generation model.
  • Figure 3: One visual result of our joint 2D-3D generation. As indicated by the colored boxes and lines, our generation results exhibit high consistency across the two modalities.
  • Figure 4: The qualitative comparisons to the baseline method of the image generation.
  • Figure 5: The qualitative comparisons to the baseline method of the LiDAR generation.
  • ...and 6 more figures