Glad: A Streaming Scene Generator for Autonomous Driving
Bin Xie, Yingfei Liu, Tiancai Wang, Jiale Cao, Xiangyu Zhang
TL;DR
Glad addresses the need for diverse, long-duration driving-scene data by enabling online, frame-by-frame video generation with temporal coherence through Latent Variable Propagation and efficient training via Streaming Data Sampler. Built on Stable Diffusion with ControlNet-conditioned BEV layouts and CLIP guidance, it can synthesize new scenes or simulate specific ones from a reference frame, producing videos of arbitrary length. The two core components deliver temporal consistency and data-efficiency, with strong generation quality (low FID/FVD) and tangible improvements in downstream perception tasks when used for augmentation. Experiments on nuScenes demonstrate competitive synthesis, robust cross-view consistency, and meaningful boosts to 3D detection and HD-map construction, highlighting Glad’s practical value for autonomous-driving simulation and data generation.
Abstract
The generation and simulation of diverse real-world scenes have significant application value in the field of autonomous driving, especially for the corner cases. Recently, researchers have explored employing neural radiance fields or diffusion models to generate novel views or synthetic data under driving scenes. However, these approaches suffer from unseen scenes or restricted video length, thus lacking sufficient adaptability for data generation and simulation. To address these issues, we propose a simple yet effective framework, named Glad, to generate video data in a frame-by-frame style. To ensure the temporal consistency of synthetic video, we introduce a latent variable propagation module, which views the latent features of previous frame as noise prior and injects it into the latent features of current frame. In addition, we design a streaming data sampler to orderly sample the original image in a video clip at continuous iterations. Given the reference frame, our Glad can be viewed as a streaming simulator by generating the videos for specific scenes. Extensive experiments are performed on the widely-used nuScenes dataset. Experimental results demonstrate that our proposed Glad achieves promising performance, serving as a strong baseline for online video generation. We will release the source code and models publicly.
