CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving
Tianrui Zhang, Yichen Liu, Zilin Guo, Yuxin Guo, Jingcheng Ni, Chenjing Ding, Dan Xu, Lewei Lu, Zehuan Wu
TL;DR
CVD-STORM tackles the challenge of learning driving world models that support long-term, multi-view video generation and accurate 4D reconstruction. It jointly trains STORM-VAE, a VAE augmented with a Gaussian Splatting decoder, and a cross-view diffusion backbone to produce six-view videos conditioned on text, BBox, and HD map, while reconstructing 4D scenes via a Gaussian Splatting decoder. The approach leverages a single-stage training regime and a rectified flow loss, achieving substantial improvements in FID and FVD on nuScenes and enabling up to 20-second coherent sequences with depth-aware reconstruction. This work advances autonomous-driving world models by integrating high-fidelity generation with explicit 4D geometry, yielding both improved visual realism and richer scene understanding for planning and simulation.
Abstract
Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding. Our project page is https://sensetime-fvg.github.io/CVD-STORM.
