Table of Contents
Fetching ...

CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving

Tianrui Zhang, Yichen Liu, Zilin Guo, Yuxin Guo, Jingcheng Ni, Chenjing Ding, Dan Xu, Lewei Lu, Zehuan Wu

TL;DR

CVD-STORM tackles the challenge of learning driving world models that support long-term, multi-view video generation and accurate 4D reconstruction. It jointly trains STORM-VAE, a VAE augmented with a Gaussian Splatting decoder, and a cross-view diffusion backbone to produce six-view videos conditioned on text, BBox, and HD map, while reconstructing 4D scenes via a Gaussian Splatting decoder. The approach leverages a single-stage training regime and a rectified flow loss, achieving substantial improvements in FID and FVD on nuScenes and enabling up to 20-second coherent sequences with depth-aware reconstruction. This work advances autonomous-driving world models by integrating high-fidelity generation with explicit 4D geometry, yielding both improved visual realism and richer scene understanding for planning and simulation.

Abstract

Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding. Our project page is https://sensetime-fvg.github.io/CVD-STORM.

CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving

TL;DR

CVD-STORM tackles the challenge of learning driving world models that support long-term, multi-view video generation and accurate 4D reconstruction. It jointly trains STORM-VAE, a VAE augmented with a Gaussian Splatting decoder, and a cross-view diffusion backbone to produce six-view videos conditioned on text, BBox, and HD map, while reconstructing 4D scenes via a Gaussian Splatting decoder. The approach leverages a single-stage training regime and a rectified flow loss, achieving substantial improvements in FID and FVD on nuScenes and enabling up to 20-second coherent sequences with depth-aware reconstruction. This work advances autonomous-driving world models by integrating high-fidelity generation with explicit 4D geometry, yielding both improved visual realism and richer scene understanding for planning and simulation.

Abstract

Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding. Our project page is https://sensetime-fvg.github.io/CVD-STORM.

Paper Structure

This paper contains 23 sections, 4 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Early-Stage Generation Visualization. (a) shows the ground-truth sequence. (b) depicts the model’s output at training step 1,200 when using a standard VAE. (c) presents the corresponding output generated with our STORM-VAE at the same step. Notably, (c) exhibits significantly improved convergence and visual fidelity compared to (b), demonstrating the effectiveness of our approach even at early stage in training.
  • Figure 2: Overall framework of the model. Our pipeline contains two models. The upper section illustrates STORM-VAE training, with the forward process indicated by blue arrows. STORM-VAE takes multi-view images from context timesteps and processes the image latents through two decoders: the VAE Decoder performs image reconstruction (updated by $\mathcal{L}_{\text{VAE}}$), while the GS Decoder performs scene reconstruction (updated by $\mathcal{L}_{\text{STORM}}$). The lower section illustrates the inference pipeline of CVD-STORM , with the forward process shown by solid block arrows. The diffusion part can either use STORM-VAE latents as reference frames for prediction or generate from noise, while incorporating various conditioning inputs for guidance.
  • Figure 3: Qualitative results of Depth Estimation. This figure illustrates the depth of the videos generated by CVD-STORM at frame 0, 5, 10. Our GS decoder can successfully extract the depth information of dynamic and static objects.
  • Figure 4: Qualitative Results of Video Prediction. We produce this example using three reference frames. The first line is the first reference frame and the following lines are the predicted frames. Our method demonstrates strong temporal consistency in the video prediction task.
  • Figure 5: Qualitative Results of Video Generation. We provide the examples generated with the conditons only, without any reference frame. For each scene, we list the 1st frame in the first line and the 10th frame in the second line. The bounding boxes and road maps are overlapping over the generative images. The object in the bounding boxes with the same color are should be of the same class. For example, cars should be generated in the blue 3D bounding boxes.
  • ...and 7 more figures