DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation
Wei Wu, Xi Guo, Weixuan Tang, Tingxuan Huang, Chiyu Wang, Dongyue Chen, Chenjing Ding
TL;DR
DriveScape introduces an end‑to‑end, multi‑view driving video generation framework built on latent diffusion that conditions on BEV maps, 3D bounding boxes, and BEV key‑frames. The core innovation is the Bi-Directional Modulated Transformer (BiMoT), which aligns diverse 3D structural inputs across views to achieve high spatial–temporal coherence without post‑processing. Evaluated on nuScenes, DriveScape achieves state‑of‑the‑art FID and FVD with sparse conditioning and high resolution, enabling practical, perception‑friendly synthetic data for autonomous driving. The approach demonstrates robust control over dynamic foreground and static background, and supports efficient, parallelizable inference across views.
Abstract
Recent advancements in generative models have provided promising solutions for synthesizing realistic driving videos, which are crucial for training autonomous driving perception models. However, existing approaches often struggle with multi-view video generation due to the challenges of integrating 3D information while maintaining spatial-temporal consistency and effectively learning from a unified model. We propose DriveScape, an end-to-end framework for multi-view, 3D condition-guided video generation, capable of producing 1024 x 576 high-resolution videos at 10Hz. Unlike other methods limited to 2Hz due to the 3D box annotation frame rate, DriveScape overcomes this with its ability to operate under sparse conditions. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information, maintaining spatial-temporal consistency. DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39. Our project homepage: https://metadrivescape.github.io/papers_project/drivescapev1/index.html
