Panacea+: Panoramic and Controllable Video Generation for Autonomous Driving
Yuqing Wen, Yucheng Zhao, Yingfei Liu, Binyuan Huang, Fan Jia, Yanhui Wang, Chi Zhang, Tiancai Wang, Xiaoyan Sun, Xiangyu Zhang
TL;DR
Panacea+ addresses the scarcity of high-quality, temporally consistent BEV driving videos by extending Panacea with a two-stage, multi-view diffusion framework. It introduces decomposed 4D attention, a multi-view appearance noise prior, a BEV-layout and text conditioning pathway via ControlNet/CLIP, and a post-generation super-resolution module to produce high-resolution, annotated driving sequences. Across nuScenes and Argoverse 2, Panacea+ yields state-of-the-art generation metrics (FVD/FID), strong controllability aligned with BEV layouts, and substantial improvements in downstream tasks such as 3D object tracking, 3D object detection, and lane detection when synthetic data augment real data. The framework demonstrates practical impact by enriching training data for autonomous-driving perception, enabling better temporal consistency and high-resolution synthesis, while outlining directions to reduce computational costs and scale diffusion architectures.
Abstract
The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency and increased resolution. Extensive experiments show that the generated video samples from Panacea+ greatly benefit a wide range of tasks on different datasets, including 3D object tracking, 3D object detection, and lane detection tasks on the nuScenes and Argoverse 2 dataset. These results strongly prove Panacea+ to be a valuable data generation framework for autonomous driving.
