WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Jiachen Lu; Ze Huang; Zeyu Yang; Jiahui Zhang; Li Zhang

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, Li Zhang

TL;DR

WoVoGen addresses the challenge of generating coherent, multi-camera driving scenes by introducing an explicit 4D world volume as dense conditioning for diffusion-based synthesis. The method operates in two phases: first envisioning a future 4D BEV-volume from past frames and ego actions, then generating synchronized multi-camera videos guided by that volume, with CLIP-based world features, object-guided conditioning, and temporal attention for consistency. Key contributions include the 4D world volume formulation, a two-branch architecture (world model plus world-volume synthesis), and demonstrated improvements in cross-view and temporal coherence along with scene editing capabilities on nuScenes. This framework provides a powerful, controllable data-generation tool for autonomous driving research and dataset augmentation, with weather and location controllability and editability of scene content.

Abstract

Generating multi-camera street-view videos is critical for augmenting autonomous driving datasets, addressing the urgent demand for extensive and varied data. Due to the limitations in diversity and challenges in handling lighting conditions, traditional rendering-based methods are increasingly being supplanted by diffusion-based methods. However, a significant challenge in diffusion-based methods is ensuring that the generated sensor data preserve both intra-world consistency and inter-sensor coherence. To address these challenges, we combine an additional explicit world volume and propose the World Volume-aware Multi-camera Driving Scene Generator (WoVoGen). This system is specifically designed to leverage 4D world volume as a foundational element for video generation. Our model operates in two distinct phases: (i) envisioning the future 4D temporal world volume based on vehicle control sequences, and (ii) generating multi-camera videos, informed by this envisioned 4D temporal world volume and sensor interconnectivity. The incorporation of the 4D world volume empowers WoVoGen not only to generate high-quality street-view videos in response to vehicle control inputs but also to facilitate scene editing tasks.

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

TL;DR

Abstract

Paper Structure (30 sections, 13 equations, 14 figures, 3 tables)

This paper contains 30 sections, 13 equations, 14 figures, 3 tables.

Introduction
Related work
Method
Preliminary
Overall architecture
World volume
World model
World volume-aware 2D feature
World volume-aware diffusion generation
Video generation
Experiments
Experimental setup
Results
4D World volume generation
Multi-camera single-frame image generation
...and 15 more sections

Figures (14)

Figure 1: Our WoVoGen is crafted to generate future world volumes (i.e., HD maps and occupancy) and high-quality multi-camera street-view images, with the input of past world-volumes. The bottom row shows the weather-based control of WoVoGen. Specifically, with the predicted world volume at time $t_4$ and a weather description, the multi-camera images transit from rainy to sunny conditions, while maintaining the street layout.
Figure 2: Overall framework of WoVoGen. Top: world model branch. We finetune the AutoencoderKL and train the 4D diffusion model from scratch to generate future world volumes based on past world volumes and the actions of the ego car. Bottom: world volume-aware synthesis branch. Leveraging the generated future volumes as input, $\mathcal{F}_w$ are derived through the world encoder. Subsequent sampling yields $\mathcal{F}_{img}$, which are then aggregated. The process is finalized by applying panoptic diffusion to produce future videos.
Figure 3: (a): an action attention block enhances the model by incorporating action information. (b): a guidance attention block integrates the CLIP feature of a specific object into the latent representation.
Figure 4: WoVoGen excels in producing future world volumes (top two rows) with temporal consistency. Subsequently, it utilizes the world volume-aware 2D image features derived from the world model's outputs to synthesize a driving video (bottom two rows) with both multi-camera consistency and temporal consistency.
Figure 5: Examples of conditional generation on nuScenes caesar2020nuscenes validation dataset. WoVoGen empowers diverse and controllable scene generation. Altering the random seed allows for the generation of various scenarios. Additionally, adjustment to weather (such as sunny, rainy, night, etc.) and location (Singapore, Boston, etc.) within the prompt enables the modification of weather conditions and city styles within the generated scene.
...and 9 more figures

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

TL;DR

Abstract

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)