Table of Contents
Fetching ...

MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer

Guile Wu, David Huang, Dongfeng Bai, Bingbing Liu

Abstract

Urban scene synthesis with video generation models has recently shown great potential for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to support multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving. Although it is feasible to use multiple models to generate different modalities, this increases the difficulty of model deployment and does not leverage complementary cues for multi-modal data generation. To address this problem, in this work, we propose a novel multi-modal multi-view video generation approach to autonomous driving. Specifically, we construct a unified diffusion transformer model composed of modal-shared components and modal-specific components. Then, we leverage diverse conditioning inputs to encode controllable scene structure and content cues into the multi-modal multi-view unified diffusion model. In this way, our approach is capable of generating multi-modal multi-view driving scene videos in a unified framework. Our thorough experiments on real-world autonomous driving dataset show that our approach achieves compelling video generation quality and controllability compared with state-of-the-art methods, while supporting multi-modal multi-view data generation.

MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer

Abstract

Urban scene synthesis with video generation models has recently shown great potential for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to support multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving. Although it is feasible to use multiple models to generate different modalities, this increases the difficulty of model deployment and does not leverage complementary cues for multi-modal data generation. To address this problem, in this work, we propose a novel multi-modal multi-view video generation approach to autonomous driving. Specifically, we construct a unified diffusion transformer model composed of modal-shared components and modal-specific components. Then, we leverage diverse conditioning inputs to encode controllable scene structure and content cues into the multi-modal multi-view unified diffusion model. In this way, our approach is capable of generating multi-modal multi-view driving scene videos in a unified framework. Our thorough experiments on real-world autonomous driving dataset show that our approach achieves compelling video generation quality and controllability compared with state-of-the-art methods, while supporting multi-modal multi-view data generation.

Paper Structure

This paper contains 45 sections, 6 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: An illustration of the proposed MoVieDrive approach to urban scene video generation in autonomous driving. Our approach can be used to generate multi-modal multi-view driving scene videos, to synthesize diverse driving scenes under diverse time and weather conditions, and to generate long videos without reference frames.
  • Figure 2: An overview of the proposed MoVieDrive approach. Our approach employs diverse conditioning inputs and a multi-modal multi-view diffusion transformer model to facilitate urban scene understanding in autonomous driving.
  • Figure 3: An illustration of our diffusion transformer block.
  • Figure 4: Quantitative comparison on nuScenes. From left to right, we show results of back-right, back, back-left, front-left, front and front-right cameras and highlight some noticeable details. Please check the supplementary material for an enlarged version.
  • Figure 5: Visualization of cross-modal consistency on nuScenes.
  • ...and 14 more figures