Table of Contents
Fetching ...

Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

Hannan Lu, Xiaohe Wu, Shudong Wang, Xiameng Qin, Xinyu Zhang, Junyu Han, Wangmeng Zuo, Ji Tao

TL;DR

This work tackles the challenge of generating consistent multi-view driving videos conditioned on BEV layouts, addressing both cross-view and cross-frame coherence. It introduces CogDriving, a Diffusion Transformer with holistic-4D attention that models spatial, temporal, and viewpoint dependencies, supported by a lightweight Micro-Controller for conditioning and a re-weighted loss to emphasize small, critical objects. On nuScenes, CogDriving achieves a strong FVD of 37.8 and enables data augmentation that improves downstream BEV segmentation and 3D object detection, while maintaining controllability with far fewer parameters than traditional ControlNet pipelines. The approach demonstrates practical utility for autonomous driving perception systems and suggests further potential with richer conditioning signals and broader scene variations.

Abstract

Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of the standard ControlNet, enabling precise control over Bird's-Eye-View layouts. To enhance the generation of object instances crucial for autonomous driving, we propose a re-weighted learning objective, dynamically adjusting the learning weights for object instances during training. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos. The project can be found at https://luhannan.github.io/CogDrivingPage/.

Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

TL;DR

This work tackles the challenge of generating consistent multi-view driving videos conditioned on BEV layouts, addressing both cross-view and cross-frame coherence. It introduces CogDriving, a Diffusion Transformer with holistic-4D attention that models spatial, temporal, and viewpoint dependencies, supported by a lightweight Micro-Controller for conditioning and a re-weighted loss to emphasize small, critical objects. On nuScenes, CogDriving achieves a strong FVD of 37.8 and enables data augmentation that improves downstream BEV segmentation and 3D object detection, while maintaining controllability with far fewer parameters than traditional ControlNet pipelines. The approach demonstrates practical utility for autonomous driving perception systems and suggests further potential with richer conditioning signals and broader scene variations.

Abstract

Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of the standard ControlNet, enabling precise control over Bird's-Eye-View layouts. To enhance the generation of object instances crucial for autonomous driving, we propose a re-weighted learning objective, dynamically adjusting the learning weights for object instances during training. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos. The project can be found at https://luhannan.github.io/CogDrivingPage/.

Paper Structure

This paper contains 21 sections, 6 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Decoupled Attention gao2024magicdrive3d vs Holistic-4D Attention. Our Holistic-4D attention establishes straightforward cross-dimensional relationships, leading to explicit transmission of visual information and enhanced cross-view consistency.
  • Figure 2: Overview of our CogDriving. (a) depicts the training process of CogDriving, facilitated by the diffusion transformer with Holistic-4D Attention under the condition of BEV layouts. (b) illustrates the detailed architecture of the diffusion transformer, especially the Holistic-4D Attention to achieve the spatial-temporal-perspective mutual interaction. (c) shows the proposed Micro-Controller for the integration of various conditions.
  • Figure 3: Our lightweight Micro-Controller encodes road maps, box IDs, class IDs, and depth maps independently from 3D annotations for precise, geometry-guided synthesis.
  • Figure 4: Consistency analysis. When the same object appears at different times and views, CogDriving maintains cross-view consistency. According to the temporal profile, CogDriving shows superior cross-frame consistency with continuous lines.
  • Figure 5: Generation results of CogDriving. (a). CogDriving synthesizes multi-view driving scene videos conditioned on Bird-Eye-View (BEV) layout sequences. (b). The model demonstrates its strong generalization capability by generating diverse driving videos, including different weather, seasons, times, and extreme scenarios such as thunderstorms.
  • ...and 3 more figures