Table of Contents
Fetching ...

Glad: A Streaming Scene Generator for Autonomous Driving

Bin Xie, Yingfei Liu, Tiancai Wang, Jiale Cao, Xiangyu Zhang

TL;DR

Glad addresses the need for diverse, long-duration driving-scene data by enabling online, frame-by-frame video generation with temporal coherence through Latent Variable Propagation and efficient training via Streaming Data Sampler. Built on Stable Diffusion with ControlNet-conditioned BEV layouts and CLIP guidance, it can synthesize new scenes or simulate specific ones from a reference frame, producing videos of arbitrary length. The two core components deliver temporal consistency and data-efficiency, with strong generation quality (low FID/FVD) and tangible improvements in downstream perception tasks when used for augmentation. Experiments on nuScenes demonstrate competitive synthesis, robust cross-view consistency, and meaningful boosts to 3D detection and HD-map construction, highlighting Glad’s practical value for autonomous-driving simulation and data generation.

Abstract

The generation and simulation of diverse real-world scenes have significant application value in the field of autonomous driving, especially for the corner cases. Recently, researchers have explored employing neural radiance fields or diffusion models to generate novel views or synthetic data under driving scenes. However, these approaches suffer from unseen scenes or restricted video length, thus lacking sufficient adaptability for data generation and simulation. To address these issues, we propose a simple yet effective framework, named Glad, to generate video data in a frame-by-frame style. To ensure the temporal consistency of synthetic video, we introduce a latent variable propagation module, which views the latent features of previous frame as noise prior and injects it into the latent features of current frame. In addition, we design a streaming data sampler to orderly sample the original image in a video clip at continuous iterations. Given the reference frame, our Glad can be viewed as a streaming simulator by generating the videos for specific scenes. Extensive experiments are performed on the widely-used nuScenes dataset. Experimental results demonstrate that our proposed Glad achieves promising performance, serving as a strong baseline for online video generation. We will release the source code and models publicly.

Glad: A Streaming Scene Generator for Autonomous Driving

TL;DR

Glad addresses the need for diverse, long-duration driving-scene data by enabling online, frame-by-frame video generation with temporal coherence through Latent Variable Propagation and efficient training via Streaming Data Sampler. Built on Stable Diffusion with ControlNet-conditioned BEV layouts and CLIP guidance, it can synthesize new scenes or simulate specific ones from a reference frame, producing videos of arbitrary length. The two core components deliver temporal consistency and data-efficiency, with strong generation quality (low FID/FVD) and tangible improvements in downstream perception tasks when used for augmentation. Experiments on nuScenes demonstrate competitive synthesis, robust cross-view consistency, and meaningful boosts to 3D detection and HD-map construction, highlighting Glad’s practical value for autonomous-driving simulation and data generation.

Abstract

The generation and simulation of diverse real-world scenes have significant application value in the field of autonomous driving, especially for the corner cases. Recently, researchers have explored employing neural radiance fields or diffusion models to generate novel views or synthetic data under driving scenes. However, these approaches suffer from unseen scenes or restricted video length, thus lacking sufficient adaptability for data generation and simulation. To address these issues, we propose a simple yet effective framework, named Glad, to generate video data in a frame-by-frame style. To ensure the temporal consistency of synthetic video, we introduce a latent variable propagation module, which views the latent features of previous frame as noise prior and injects it into the latent features of current frame. In addition, we design a streaming data sampler to orderly sample the original image in a video clip at continuous iterations. Given the reference frame, our Glad can be viewed as a streaming simulator by generating the videos for specific scenes. Extensive experiments are performed on the widely-used nuScenes dataset. Experimental results demonstrate that our proposed Glad achieves promising performance, serving as a strong baseline for online video generation. We will release the source code and models publicly.

Paper Structure

This paper contains 18 sections, 4 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Comparison of Unisim, Panacea, and our proposed Glad: (a) The Nerf-based Unisim yang2023unisim struggles to render unseen scene and objects when the simulated ego trajectory deviates from the source data. (b) The diffusion-based Panacea wen2023panacea generates fixed-length video data in an offline manner. It suffers from relatively high memory consumption, and lacks the adaptability to accommodate variations in dynamic simulated trajectory. (c) Our Glad is designed for fame-by-frame generation, enabling to generate videos of arbitrary lengths and exhibiting good flexibility in the variations of simulated trajectory.
  • Figure 2: Overall architecture of our proposed Glad. Glad is based on Stable Diffusion, and can take random noise or a reference frame as input to generate new or specific scenes. Afterwards, Glad generates video sequences from frame 1 to frame N orderly. We employ the proposed latent variable propagation module to feed the denoised latent features at previous frame to current frame as the noise prior, which can maintain video temporal consistency. This frame-by-frame generation strategy enables to generate videos of arbitrary lengths. In addition, we employ ControlNet zhang2023adding to introduce BEV layout for fine-grained control on data generation.
  • Figure 3: Illustration of streaming data sampler. The streaming data sampler samples video clip from frame 1 to frame $N$ at continuous iterations. At each iteration, we save the denoised latent features generated by diffusion model in the cache, and reuse it as the noise at next iteration.
  • Figure 4: The relative detection performance of simulated data compared to real data, as the number of simulated frames increases. It tends to slightly decrease at first two frames and then becomes stable.
  • Figure 5: Visualization of data generation examples. We generate video clip by feeding Gaussian noise to our Glad, with BEV layout sequences starting at index 2208 of the nuScenes validation set.
  • ...and 4 more figures