DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes
Chensheng Peng, Chengwei Zhang, Yixiao Wang, Chenfeng Xu, Yichen Xie, Wenzhao Zheng, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan
TL;DR
DeSiRe-GS presents a self-supervised 4D Gaussian Splatting framework that achieves static-dynamic decomposition and high-fidelity surface reconstruction in urban driving scenes without 3D bounding boxes. It introduces a two-stage pipeline: Stage I learns 2D motion masks from render-vs-ground-truth differences using a frozen foundation model, and Stage II distills these masks into PVG-based time-varying Gaussians with velocity regularization and geometric constraints. The method employs geometric regularization, normal derivation from Gaussian scales, giant-Gaussian penalties, and temporal cross-view consistency to produce physically plausible surfaces. Across Waymo and KITTI, DeSiRe-GS delivers state-of-the-art rendering performance and competitive depth accuracy with real-time-like speeds, demonstrating strong robustness to data sparsity and dynamic objects in driving scenes.
Abstract
We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling effective static-dynamic decomposition and high-fidelity surface reconstruction in complex driving scenarios. Our approach employs a two-stage optimization pipeline of dynamic street Gaussians. In the first stage, we extract 2D motion masks based on the observation that 3D Gaussian Splatting inherently can reconstruct only the static regions in dynamic environments. These extracted 2D motion priors are then mapped into the Gaussian space in a differentiable manner, leveraging an efficient formulation of dynamic Gaussians in the second stage. Combined with the introduced geometric regularizations, our method are able to address the over-fitting issues caused by data sparsity in autonomous driving, reconstructing physically plausible Gaussians that align with object surfaces rather than floating in air. Furthermore, we introduce temporal cross-view consistency to ensure coherence across time and viewpoints, resulting in high-quality surface reconstruction. Comprehensive experiments demonstrate the efficiency and effectiveness of DeSiRe-GS, surpassing prior self-supervised arts and achieving accuracy comparable to methods relying on external 3D bounding box annotations. Code is available at https://github.com/chengweialan/DeSiRe-GS
