Table of Contents
Fetching ...

Panacea+: Panoramic and Controllable Video Generation for Autonomous Driving

Yuqing Wen, Yucheng Zhao, Yingfei Liu, Binyuan Huang, Fan Jia, Yanhui Wang, Chi Zhang, Tiancai Wang, Xiaoyan Sun, Xiangyu Zhang

TL;DR

Panacea+ addresses the scarcity of high-quality, temporally consistent BEV driving videos by extending Panacea with a two-stage, multi-view diffusion framework. It introduces decomposed 4D attention, a multi-view appearance noise prior, a BEV-layout and text conditioning pathway via ControlNet/CLIP, and a post-generation super-resolution module to produce high-resolution, annotated driving sequences. Across nuScenes and Argoverse 2, Panacea+ yields state-of-the-art generation metrics (FVD/FID), strong controllability aligned with BEV layouts, and substantial improvements in downstream tasks such as 3D object tracking, 3D object detection, and lane detection when synthetic data augment real data. The framework demonstrates practical impact by enriching training data for autonomous-driving perception, enabling better temporal consistency and high-resolution synthesis, while outlining directions to reduce computational costs and scale diffusion architectures.

Abstract

The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency and increased resolution. Extensive experiments show that the generated video samples from Panacea+ greatly benefit a wide range of tasks on different datasets, including 3D object tracking, 3D object detection, and lane detection tasks on the nuScenes and Argoverse 2 dataset. These results strongly prove Panacea+ to be a valuable data generation framework for autonomous driving.

Panacea+: Panoramic and Controllable Video Generation for Autonomous Driving

TL;DR

Panacea+ addresses the scarcity of high-quality, temporally consistent BEV driving videos by extending Panacea with a two-stage, multi-view diffusion framework. It introduces decomposed 4D attention, a multi-view appearance noise prior, a BEV-layout and text conditioning pathway via ControlNet/CLIP, and a post-generation super-resolution module to produce high-resolution, annotated driving sequences. Across nuScenes and Argoverse 2, Panacea+ yields state-of-the-art generation metrics (FVD/FID), strong controllability aligned with BEV layouts, and substantial improvements in downstream tasks such as 3D object tracking, 3D object detection, and lane detection when synthetic data augment real data. The framework demonstrates practical impact by enriching training data for autonomous-driving perception, enabling better temporal consistency and high-resolution synthesis, while outlining directions to reduce computational costs and scale diffusion architectures.

Abstract

The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency and increased resolution. Extensive experiments show that the generated video samples from Panacea+ greatly benefit a wide range of tasks on different datasets, including 3D object tracking, 3D object detection, and lane detection tasks on the nuScenes and Argoverse 2 dataset. These results strongly prove Panacea+ to be a valuable data generation framework for autonomous driving.
Paper Structure (23 sections, 8 equations, 7 figures, 8 tables)

This paper contains 23 sections, 8 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Framework outline of Panacea+. Panancea+ is built upon Panacea while introducing an appearance noise prior and a super-resolution module. Here "D" denotes the VAE decoder.
  • Figure 2: Visualizations of the Panacea+'s capability. Panacea+ is able to generate high-quality and high resolution multi-view videos with BEV layout control and textual control. The generated samples can greatly benefit many tasks on different datasets in autonomous driving including object tracking, object detection, and lane detection, verifying its power and versatility for autonomous driving.
  • Figure 3: Overview of Panacea+. (a). The diffusion training process of Panacea+, enabled by a diffusion encoder and decoder with the decomposed 4D attention module. An appearance noise prior is applied for enhancing temporal consistency. (b). The decomposed 4D attention module comprises three components: intra-view attention for spatial processing within individual views, cross-view attention to engage with adjacent views, and cross-frame attention for temporal processing. (c). Controllable module for the integration of diverse signals. The image conditions are derived from a frozen VAE encoder and combined with diffused noises. The text prompts are processed through a frozen CLIP encoder, while BEV sequences are handled via ControlNet. (d). The details of BEV layout sequences, including projected bounding boxes, object depths, road maps and camera poses.
  • Figure 4: The two-stage inference pipeline of Panacea+. Its two-stage process begins by creating multi-view images with BEV layouts, followed by using these images, along with subsequent BEV layouts, to facilitate the generation of following frames. A super-resolution module is further appended to increase the resolution.
  • Figure 5: Generated videos generated by Panacea+ on nuScenes dataset.
  • ...and 2 more figures