Table of Contents
Fetching ...

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

Rui Chen, Zehuan Wu, Yichen Liu, Yuxin Guo, Jingcheng Ni, Haifeng Xia, Siyu Xia

TL;DR

UniMLVG tackles the challenge of generating long, surround-view driving videos with precise controllability. It extends a DiT-based diffusion backbone with temporal and cross-view modules, plus explicit perspective modeling, and trains with a multi-task, multi-stage strategy across diverse datasets and conditioning modalities (text and 3D scene cues). The approach delivers substantial improvements in realism and temporal coherence (FID/FVD) and supports flexible editing, including weather changes and 3D-conditioned transformations, while handling varying numbers of viewpoints. This framework enables high-quality, controllable autonomous-driving data synthesis, with practical impact for perception and planning research and development.

Abstract

The creation of diverse and realistic driving scenarios has become essential to enhance perception and planning capabilities of the autonomous driving system. However, generating long-duration, surround-view consistent driving videos remains a significant challenge. To address this, we present UniMLVG, a unified framework designed to generate extended street multi-perspective videos under precise control. By integrating single- and multi-view driving videos into the training data, our approach updates a DiT-based diffusion model equipped with cross-frame and cross-view modules across three stages with multi training objectives, substantially boosting the diversity and quality of generated visual content. Importantly, we propose an innovative explicit viewpoint modeling approach for multi-view video generation to effectively improve motion transition consistency. Capable of handling various input reference formats (e.g., text, images, or video), our UniMLVG generates high-quality multi-view videos according to the corresponding condition constraints such as 3D bounding boxes or frame-level text descriptions. Compared to the best models with similar capabilities, our framework achieves improvements of 48.2% in FID and 35.2% in FVD.

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

TL;DR

UniMLVG tackles the challenge of generating long, surround-view driving videos with precise controllability. It extends a DiT-based diffusion backbone with temporal and cross-view modules, plus explicit perspective modeling, and trains with a multi-task, multi-stage strategy across diverse datasets and conditioning modalities (text and 3D scene cues). The approach delivers substantial improvements in realism and temporal coherence (FID/FVD) and supports flexible editing, including weather changes and 3D-conditioned transformations, while handling varying numbers of viewpoints. This framework enables high-quality, controllable autonomous-driving data synthesis, with practical impact for perception and planning research and development.

Abstract

The creation of diverse and realistic driving scenarios has become essential to enhance perception and planning capabilities of the autonomous driving system. However, generating long-duration, surround-view consistent driving videos remains a significant challenge. To address this, we present UniMLVG, a unified framework designed to generate extended street multi-perspective videos under precise control. By integrating single- and multi-view driving videos into the training data, our approach updates a DiT-based diffusion model equipped with cross-frame and cross-view modules across three stages with multi training objectives, substantially boosting the diversity and quality of generated visual content. Importantly, we propose an innovative explicit viewpoint modeling approach for multi-view video generation to effectively improve motion transition consistency. Capable of handling various input reference formats (e.g., text, images, or video), our UniMLVG generates high-quality multi-view videos according to the corresponding condition constraints such as 3D bounding boxes or frame-level text descriptions. Compared to the best models with similar capabilities, our framework achieves improvements of 48.2% in FID and 35.2% in FVD.

Paper Structure

This paper contains 18 sections, 4 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Four tasks our model can perform: (a) generating a $20$s multi-view video based on reference frames; (b) generating a $20$s multi-view video without any reference frames; (c) creating a realistic surround-view video from conditions obtained in a simulated environment; (d) altering weather conditions from sunny to snowy, driven by text-based prompts.
  • Figure 2: Overall framework of the model.left: The encoded reference frames are concatenated with the noisy latent as video latent and fed into $N$ UniMLVG blocks. The diverse conditions including image-level descriptions, camera pose, and 3D conditions are injected into each UniMLGV block and interact with the video latent to guide the generated contents. Finally, the model outputs the subsequent frames, which can then be used as the reference frame for the next autoregressive generation. Note that our model can produce driving video based on those conditions only, where the reference frames are not required. right: Details of the UniMLVG block. A UniMLVG block comprises three distinct sub-blocks to perform attention across different dimensions, while the different conditions are integrated into the video latent in different positions during the forward passing.
  • Figure 3: Field of view overlap between cameras over time.
  • Figure 4: Text-based weather editing at different times of day: (a) shows text-based control changing sunny to rainy. (b) demonstrates text editing to generate a snowy night scenario. In each subfigure, the left side shows the ground truth, while the right side presents the generated results, with the top and bottom representing the front and rear viewpoints.
  • Figure 5: Examples of scene generation diversity under various weather conditions. (a) Under sunny conditions, the appearance and number of houses, cloud positions, and sunlight direction differ from the ground truth (GT). (b) Under cloudy conditions, the appearance of houses and the colors of nearby vehicles differ from GT. (c) Under rainy conditions, both the appearance of houses and vehicles deviate from GT. The top row displays the ground truth.
  • ...and 15 more figures