Table of Contents
Fetching ...

CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving

Yishen Ji, Ziyue Zhu, Zhenxin Zhu, Kaixin Xiong, Ming Lu, Zhiqi Li, Lijun Zhou, Haiyang Sun, Bing Wang, Tong Lu

TL;DR

CoGen tackles the challenge of generating photorealistic, 3D‑consistent driving videos by replacing 2D conditioning with temporally coherent 3D semantics. It introduces a temporal 3D semantics generator, a 3D geometry‑aware diffusion transformer, and a lightweight Consistency Adapter to fuse multiple conditions while preserving 3D structure. A foreground‑aware loss and ray‑cast projections yield four 3D‑driven conditioning maps (Semantic Map, Depth Map, Coordinate Map, MPI) that significantly improve 3D coherence and visual fidelity. Evaluations on nuScenes show state‑of‑the‑art FVD/FID and strong downstream performance, demonstrating the practical value of high‑fidelity synthetic data for autonomous driving tasks.

Abstract

Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.

CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving

TL;DR

CoGen tackles the challenge of generating photorealistic, 3D‑consistent driving videos by replacing 2D conditioning with temporally coherent 3D semantics. It introduces a temporal 3D semantics generator, a 3D geometry‑aware diffusion transformer, and a lightweight Consistency Adapter to fuse multiple conditions while preserving 3D structure. A foreground‑aware loss and ray‑cast projections yield four 3D‑driven conditioning maps (Semantic Map, Depth Map, Coordinate Map, MPI) that significantly improve 3D coherence and visual fidelity. Evaluations on nuScenes show state‑of‑the‑art FVD/FID and strong downstream performance, demonstrating the practical value of high‑fidelity synthetic data for autonomous driving tasks.

Abstract

Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.

Paper Structure

This paper contains 24 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of our model. (a) Training and inference pipeline. Using BEV maps as conditions, we generate temporal 3D semantics sequences, which are then projected and encoded to provide guidance for video generation. During projection, a foreground object mask is created and incorporated into training with a foreground mask loss reweight, enhancing supervision for foreground generation quality. (b) Details of 3D semantics projection and encoding. Various forms of guidance are fused through $1\times1$ convolutions. (c) Illustration of our diffusion transformer architecture.
  • Figure 2: Visualization of the 3d semantics conditions used for video generation. Each condition is derived by projecting the 3D semantics grid into the camera view using ray casting, capturing essential geometric and semantic information for enhanced video generation.
  • Figure 3: Architecture of the Consistency Adapter. Here, $c$ represents the control conditions output from the control block, and $c'$ denotes the adapter’s output, which replaces $c$ and is integrated into the base block.
  • Figure 4: An example of out generated long driving video. Contents in red boxes indicate that adjacent viewpoints maintain the same appearance.The yellow arrows show that the generated images remain consistent over different time steps.
  • Figure 5: Qualitative comparison with MagicDrive. Our approach exhibits enhanced spatial-temporal consistency and finer details, particularly noticeable in the foreground objects and scene geometry.
  • ...and 2 more figures