UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
Rui Chen, Zehuan Wu, Yichen Liu, Yuxin Guo, Jingcheng Ni, Haifeng Xia, Siyu Xia
TL;DR
UniMLVG tackles the challenge of generating long, surround-view driving videos with precise controllability. It extends a DiT-based diffusion backbone with temporal and cross-view modules, plus explicit perspective modeling, and trains with a multi-task, multi-stage strategy across diverse datasets and conditioning modalities (text and 3D scene cues). The approach delivers substantial improvements in realism and temporal coherence (FID/FVD) and supports flexible editing, including weather changes and 3D-conditioned transformations, while handling varying numbers of viewpoints. This framework enables high-quality, controllable autonomous-driving data synthesis, with practical impact for perception and planning research and development.
Abstract
The creation of diverse and realistic driving scenarios has become essential to enhance perception and planning capabilities of the autonomous driving system. However, generating long-duration, surround-view consistent driving videos remains a significant challenge. To address this, we present UniMLVG, a unified framework designed to generate extended street multi-perspective videos under precise control. By integrating single- and multi-view driving videos into the training data, our approach updates a DiT-based diffusion model equipped with cross-frame and cross-view modules across three stages with multi training objectives, substantially boosting the diversity and quality of generated visual content. Importantly, we propose an innovative explicit viewpoint modeling approach for multi-view video generation to effectively improve motion transition consistency. Capable of handling various input reference formats (e.g., text, images, or video), our UniMLVG generates high-quality multi-view videos according to the corresponding condition constraints such as 3D bounding boxes or frame-level text descriptions. Compared to the best models with similar capabilities, our framework achieves improvements of 48.2% in FID and 35.2% in FVD.
