CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving
Yishen Ji, Ziyue Zhu, Zhenxin Zhu, Kaixin Xiong, Ming Lu, Zhiqi Li, Lijun Zhou, Haiyang Sun, Bing Wang, Tong Lu
TL;DR
CoGen tackles the challenge of generating photorealistic, 3D‑consistent driving videos by replacing 2D conditioning with temporally coherent 3D semantics. It introduces a temporal 3D semantics generator, a 3D geometry‑aware diffusion transformer, and a lightweight Consistency Adapter to fuse multiple conditions while preserving 3D structure. A foreground‑aware loss and ray‑cast projections yield four 3D‑driven conditioning maps (Semantic Map, Depth Map, Coordinate Map, MPI) that significantly improve 3D coherence and visual fidelity. Evaluations on nuScenes show state‑of‑the‑art FVD/FID and strong downstream performance, demonstrating the practical value of high‑fidelity synthetic data for autonomous driving tasks.
Abstract
Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.
