Table of Contents
Fetching ...

UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction

Chen Shi, Shaoshuai Shi, Xiaoyang Lyu, Chunyang Liu, Kehua Sheng, Bo Zhang, Li Jiang

TL;DR

UniSplat tackles dynamic driving scene reconstruction from sparse, minimally overlapping multi-camera views by introducing a unified 3D latent scaffold that integrates spatial and temporal information in an ego-centric frame. It combines geometry and semantic priors from foundation models, a sparse 3D fusion mechanism for both current and past frames, and a dual-branch Gaussian decoder to produce dynamic-aware primitives while maintaining a memory bank of static content for long-term scene completion. The training objective jointly optimizes image reconstruction, perceptual quality, dynamic consistency, and scale alignment across input and novel views. Empirical results on Waymo Open and nuScenes demonstrate state-of-the-art performance and robust rendering beyond original camera coverage, underscoring UniSplat’s potential for real-time driving applications and lifelong environment modeling.

Abstract

Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics. We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining point-anchored refinement with voxel-based generation, and maintain a persistent memory of static Gaussians to enable streaming scene completion beyond current camera coverage. Extensive experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art performance in novel view synthesis, while providing robust and high-quality renderings even for viewpoints outside the original camera coverage.

UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction

TL;DR

UniSplat tackles dynamic driving scene reconstruction from sparse, minimally overlapping multi-camera views by introducing a unified 3D latent scaffold that integrates spatial and temporal information in an ego-centric frame. It combines geometry and semantic priors from foundation models, a sparse 3D fusion mechanism for both current and past frames, and a dual-branch Gaussian decoder to produce dynamic-aware primitives while maintaining a memory bank of static content for long-term scene completion. The training objective jointly optimizes image reconstruction, perceptual quality, dynamic consistency, and scale alignment across input and novel views. Empirical results on Waymo Open and nuScenes demonstrate state-of-the-art performance and robust rendering beyond original camera coverage, underscoring UniSplat’s potential for real-time driving applications and lifelong environment modeling.

Abstract

Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics. We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining point-anchored refinement with voxel-based generation, and maintain a persistent memory of static Gaussians to enable streaming scene completion beyond current camera coverage. Extensive experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art performance in novel view synthesis, while providing robust and high-quality renderings even for viewpoints outside the original camera coverage.

Paper Structure

This paper contains 13 sections, 11 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of UniSplat. Given multi-camera images from vehicle-mounted cameras, UniSplat leverages foundation models to construct geometry-semantic aware 3D latent scaffolds, where unified spatio-temporal fusion is performed. From this scaffold, a dual-branch decoder generates dynamic-aware Gaussian primitives using both point anchors and voxel centers, with dynamic filtering maintaining a persistent memory of static scene content. The red boxes highlight a dynamic car that is filtered out in our memory module (best viewed when zoomed in).
  • Figure 2: Qualitative comparisons on the Waymo dataset. Our method yields more detailed and consistent geometry than existing works. Red boxes indicate artifacts. Best viewed zoomed in.
  • Figure 3: Qualitative results of scene completion on the Waymo dataset. Top: Aggregated scene without dynamic filtering, where red boxes indicate ghosting artifacts caused by accumulating the dynamic car. Bottom: Our method, equipped with dynamic-aware Gaussians, completes unobserved regions due to limited sensor coverage and bridges cross-camera gaps while avoiding dynamic artifacts. The predicted dynamic masks used for filtering are shown for reference.