Table of Contents
Fetching ...

WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving

Ziyue Zhu, Zhanqian Wu, Zhenxin Zhu, Lijun Zhou, Haiyang Sun, Bing Wan, Kun Ma, Guang Chen, Hangjun Ye, Jin Xie, jian Yang

TL;DR

WorldSplat presents a feed-forward 4D driving-scene generator that jointly learns a 4D-aware latent diffusion model and a latent 4D Gaussians decoder to produce pixel-aligned 3D Gaussians, followed by an enhanced diffusion refinement for high-fidelity novel-view videos. By embedding multi-modal cues (RGB, depth, semantics) and explicit static-dynamic decomposition, the method achieves temporally and spatially consistent cross-view synthesis without per-scene optimization. Extensive nuScenes experiments show state-of-the-art performance in both original-view video generation and novel-view synthesis, with demonstrated downstream gains in perception tasks when using generated data. The framework enables controllable, high-quality 4D driving scene generation suitable for training, evaluation, and scenario simulation in autonomous driving pipelines.

Abstract

Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose WorldSplat, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: (i) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. (ii) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that WorldSplat effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos. Project: https://wm-research.github.io/worldsplat/

WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving

TL;DR

WorldSplat presents a feed-forward 4D driving-scene generator that jointly learns a 4D-aware latent diffusion model and a latent 4D Gaussians decoder to produce pixel-aligned 3D Gaussians, followed by an enhanced diffusion refinement for high-fidelity novel-view videos. By embedding multi-modal cues (RGB, depth, semantics) and explicit static-dynamic decomposition, the method achieves temporally and spatially consistent cross-view synthesis without per-scene optimization. Extensive nuScenes experiments show state-of-the-art performance in both original-view video generation and novel-view synthesis, with demonstrated downstream gains in perception tasks when using generated data. The framework enables controllable, high-quality 4D driving scene generation suitable for training, evaluation, and scenario simulation in autonomous driving pipelines.

Abstract

Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose WorldSplat, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: (i) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. (ii) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that WorldSplat effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos. Project: https://wm-research.github.io/worldsplat/

Paper Structure

This paper contains 26 sections, 6 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Comparison of different driving world models. Previous driving world models focus on video generation, while our method directly creates controllable 4D Gaussians in a feed-forward manner, enabling the production of novel‐view videos (e.g. shifting ego trajectory $\pm N m$) with spatiotemporal consistency.
  • Figure 2: The overview of our framework. (1) Employing a 4D-aware diffusion model to generate a multi-modal latent containing RGB, depth, and dynamic information. (2) Predicting pixel-aligned 3D Gaussians from the denoised latent using our feed-forward latent decoder. (3) Aggregating the 3D Gaussians with dynamic-static decomposition to form 4D Gaussians and rendering novel-view videos. (4) Improving the spatial resolution and temporal consistency of the rendered videos with an enhanced diffusion model. The $\uparrow$ arrow and the $\uparrow$ ones denote the train-only and inference.
  • Figure 3: Effectiveness of the enhanced diffusion model. During novel-view video synthesis, rendering quality may degrade due to unobserved regions or high ego-vehicle speed, resulting in missing content and artifacts. Our enhanced diffusion model can inpaint unobserved areas and sharpen fast-motion frames.
  • Figure 4: Comparison with MagicDrive gao2023magicdrive and Panacea wen2024panacea. The top row shows real frames, the second row the corresponding sketches and bounding-box controls. Red boxes highlight areas where our method achieves the most notable improvements.
  • Figure 5: Qualitative comparison of our novel view synthesis against the state-of-the-art urban reconstruction method chen2024omnire. We translate the ego-vehicle by $\pm2\,$m to generate the novel viewpoints. Red boxes indicate where our method achieves the greatest improvements.
  • ...and 8 more figures