Table of Contents
Fetching ...

DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation

Wei Wu, Xi Guo, Weixuan Tang, Tingxuan Huang, Chiyu Wang, Dongyue Chen, Chenjing Ding

TL;DR

DriveScape introduces an end‑to‑end, multi‑view driving video generation framework built on latent diffusion that conditions on BEV maps, 3D bounding boxes, and BEV key‑frames. The core innovation is the Bi-Directional Modulated Transformer (BiMoT), which aligns diverse 3D structural inputs across views to achieve high spatial–temporal coherence without post‑processing. Evaluated on nuScenes, DriveScape achieves state‑of‑the‑art FID and FVD with sparse conditioning and high resolution, enabling practical, perception‑friendly synthetic data for autonomous driving. The approach demonstrates robust control over dynamic foreground and static background, and supports efficient, parallelizable inference across views.

Abstract

Recent advancements in generative models have provided promising solutions for synthesizing realistic driving videos, which are crucial for training autonomous driving perception models. However, existing approaches often struggle with multi-view video generation due to the challenges of integrating 3D information while maintaining spatial-temporal consistency and effectively learning from a unified model. We propose DriveScape, an end-to-end framework for multi-view, 3D condition-guided video generation, capable of producing 1024 x 576 high-resolution videos at 10Hz. Unlike other methods limited to 2Hz due to the 3D box annotation frame rate, DriveScape overcomes this with its ability to operate under sparse conditions. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information, maintaining spatial-temporal consistency. DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39. Our project homepage: https://metadrivescape.github.io/papers_project/drivescapev1/index.html

DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation

TL;DR

DriveScape introduces an end‑to‑end, multi‑view driving video generation framework built on latent diffusion that conditions on BEV maps, 3D bounding boxes, and BEV key‑frames. The core innovation is the Bi-Directional Modulated Transformer (BiMoT), which aligns diverse 3D structural inputs across views to achieve high spatial–temporal coherence without post‑processing. Evaluated on nuScenes, DriveScape achieves state‑of‑the‑art FID and FVD with sparse conditioning and high resolution, enabling practical, perception‑friendly synthetic data for autonomous driving. The approach demonstrates robust control over dynamic foreground and static background, and supports efficient, parallelizable inference across views.

Abstract

Recent advancements in generative models have provided promising solutions for synthesizing realistic driving videos, which are crucial for training autonomous driving perception models. However, existing approaches often struggle with multi-view video generation due to the challenges of integrating 3D information while maintaining spatial-temporal consistency and effectively learning from a unified model. We propose DriveScape, an end-to-end framework for multi-view, 3D condition-guided video generation, capable of producing 1024 x 576 high-resolution videos at 10Hz. Unlike other methods limited to 2Hz due to the 3D box annotation frame rate, DriveScape overcomes this with its ability to operate under sparse conditions. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information, maintaining spatial-temporal consistency. DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39. Our project homepage: https://metadrivescape.github.io/papers_project/drivescapev1/index.html
Paper Structure (17 sections, 7 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: End-to-End Multi-view Video Generation pipeline. We use learnable embedding vectors to represent different cameras and categorize camera views as key views and neighbour views. Our training scheme guides the generation of key view videos through their neighboring frames. Additionally, we introduce the key-frames condition along with a training and inference scheme to ensure multi-view consistency simultaneously. Furthermore, our model does not require any post-refine process as DriveDiffusion (drivediffusion). It is capable of learning multi-view and temporal consistency simultaneously, resulting in high-fidelity street-view synthesis.
  • Figure 2: The DriveScape pipeline operates on the LDM pipeline to generate street-view videos conditioned on scene annotations, BEV maps, and 3D bounding boxes for each view. Additionally, we also introduce keyframes conditions collaborating with temporal and spatial self-attention modules to achieve consistency in spatial and temporal dimensions. Our approach establishes a unified model for multi-view video generation without requiring complex post-processing or any post-refinement procedures. Furthermore, the Bi-Directional Modulated Transformer(BiMoT) module which comprises two cross-attention layers with opposite directions and one temporal self-attention layer enables effective alignment and synergy between various 3D road structural information to achieve precise control over video generation.
  • Figure 3: Showcases for multi-view video generation of DriveScape; Our model achieves consistency across frames and viewpoints. Specifically, the vehicles generated by our model are accurately positioned following the provided 3D layouts, and the depiction of the streets closely aligns with the supplied projected maps. Importantly, even in cases where the provided bounding boxes extend beyond the visible area, our model showcases the capability to adhere to the specified conditions and generate the remaining objects with high fidelity.
  • Figure 4: Sparse Condition Control; Our model can generate consistent videos both in time and space with sparse condition with high resolution 576 $\times$ 1024 at 10 fps;
  • Figure 5: Showcase for maps and 3D layouts control under different conditons. (a): control by the reference maps and 3D Layouts; (b) change 3D Layouts with more vehicles; (c) change the maps with curved pedestrian crosswalk; (d) remove all the 3D Layouts.
  • ...and 2 more figures