Table of Contents
Fetching ...

Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model

Lening Wang, Wenzhao Zheng, Dalong Du, Yunpeng Zhang, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, Jie Zhou, Jiwen Lu, Shanghang Zhang

TL;DR

Stag-1 tackles the gap in realistic 4D autonomous driving simulation by decoupling spatial and temporal dynamics and reconstructing 4D point clouds from surround-view data. It combines a two-stage training pipeline with a cross-view diffusion-based video generator to produce controllable, viewpoint-consistent 4D driving scenes. The method achieves improved scene reconstruction, multi-view coherence, and realistic temporal evolution compared to 3D-based baselines. This work enables more rigorous, scalable testing and validation of autonomous driving systems.

Abstract

4D driving simulation is essential for developing realistic autonomous driving simulators. Despite advancements in existing methods for generating driving scenes, significant challenges remain in view transformation and spatial-temporal dynamic modeling. To address these limitations, we propose a Spatial-Temporal simulAtion for drivinG (Stag-1) model to reconstruct real-world scenes and design a controllable generative network to achieve 4D simulation. Stag-1 constructs continuous 4D point cloud scenes using surround-view data from autonomous vehicles. It decouples spatial-temporal relationships and produces coherent keyframe videos. Additionally, Stag-1 leverages video generation models to obtain photo-realistic and controllable 4D driving simulation videos from any perspective. To expand the range of view generation, we train vehicle motion videos based on decomposed camera poses, enhancing modeling capabilities for distant scenes. Furthermore, we reconstruct vehicle camera trajectories to integrate 3D points across consecutive views, enabling comprehensive scene understanding along the temporal dimension. Following extensive multi-level scene training, Stag-1 can simulate from any desired viewpoint and achieve a deep understanding of scene evolution under static spatial-temporal conditions. Compared to existing methods, our approach shows promising performance in multi-view scene consistency, background coherence, and accuracy, and contributes to the ongoing advancements in realistic autonomous driving simulation. Code: https://github.com/wzzheng/Stag.

Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model

TL;DR

Stag-1 tackles the gap in realistic 4D autonomous driving simulation by decoupling spatial and temporal dynamics and reconstructing 4D point clouds from surround-view data. It combines a two-stage training pipeline with a cross-view diffusion-based video generator to produce controllable, viewpoint-consistent 4D driving scenes. The method achieves improved scene reconstruction, multi-view coherence, and realistic temporal evolution compared to 3D-based baselines. This work enables more rigorous, scalable testing and validation of autonomous driving systems.

Abstract

4D driving simulation is essential for developing realistic autonomous driving simulators. Despite advancements in existing methods for generating driving scenes, significant challenges remain in view transformation and spatial-temporal dynamic modeling. To address these limitations, we propose a Spatial-Temporal simulAtion for drivinG (Stag-1) model to reconstruct real-world scenes and design a controllable generative network to achieve 4D simulation. Stag-1 constructs continuous 4D point cloud scenes using surround-view data from autonomous vehicles. It decouples spatial-temporal relationships and produces coherent keyframe videos. Additionally, Stag-1 leverages video generation models to obtain photo-realistic and controllable 4D driving simulation videos from any perspective. To expand the range of view generation, we train vehicle motion videos based on decomposed camera poses, enhancing modeling capabilities for distant scenes. Furthermore, we reconstruct vehicle camera trajectories to integrate 3D points across consecutive views, enabling comprehensive scene understanding along the temporal dimension. Following extensive multi-level scene training, Stag-1 can simulate from any desired viewpoint and achieve a deep understanding of scene evolution under static spatial-temporal conditions. Compared to existing methods, our approach shows promising performance in multi-view scene consistency, background coherence, and accuracy, and contributes to the ongoing advancements in realistic autonomous driving simulation. Code: https://github.com/wzzheng/Stag.

Paper Structure

This paper contains 12 sections, 13 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Spatial-Temporal simulAtion for drivinG (Stag-1) enables controllable 4D autonomous driving simulation with spatial-temporal decoupling. Stag-1 can decompose the original spatial-temporal relationships of real-world scenes to enable controllable autonomous driving simulation. This allows for adjustments such as fixing the camera viewpoint while advancing time or translating and rotating space while keeping time stationary. Additionally, Stag-1 maintains synchronized variations across six panoramic views.
  • Figure 2: Our Stag-1 framework is a 4D generative model for autonomous driving simulation. It reconstructs 4D scenes from point clouds and projects them into continuous, sparse keyframes. A spatial-temporal fusion framework is then used to generate simulation scenarios. Two key design aspects guide our approach: 1) We develop a method for 4D point cloud matching and keyframe reconstruction, ensuring the accurate generation of continuous, sparse keyframes that account for both vehicle motion and the need for spatial-temporal decoupling in simulation. 2) We build a spatial-temporal fusion framework that integrates surround-view information and continuous scene projection to ensure accurate simulation generation.
  • Figure 3: The Stag-1 training framework pipeline is designed in two Stages. In the time-focused Stage, we use even keyframes from a single viewpoint to generate a 4D point cloud, which is then projected with odd keyframe parameters as conditions, with the odd keyframes serving as labels for training. In the spatial-focused Stage, surround-view information is incorporated to extract inter-image features from the surrounding viewpoints, followed by the training of the spatial-temporal block.
  • Figure 4: Qualitative comparison on the Waymo-NOTR Datasets yang2023emernerf. Left shows novel view synthesis results, right shows dynamic scene reconstruction.
  • Figure 5: Qualitative comparison on the Waymo-Street Datasets yan2024street. The results show that our method outperforms existing approaches in scene reconstruction.
  • ...and 5 more figures