Table of Contents
Fetching ...

DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes

Chensheng Peng, Chengwei Zhang, Yixiao Wang, Chenfeng Xu, Yichen Xie, Wenzhao Zheng, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan

TL;DR

DeSiRe-GS presents a self-supervised 4D Gaussian Splatting framework that achieves static-dynamic decomposition and high-fidelity surface reconstruction in urban driving scenes without 3D bounding boxes. It introduces a two-stage pipeline: Stage I learns 2D motion masks from render-vs-ground-truth differences using a frozen foundation model, and Stage II distills these masks into PVG-based time-varying Gaussians with velocity regularization and geometric constraints. The method employs geometric regularization, normal derivation from Gaussian scales, giant-Gaussian penalties, and temporal cross-view consistency to produce physically plausible surfaces. Across Waymo and KITTI, DeSiRe-GS delivers state-of-the-art rendering performance and competitive depth accuracy with real-time-like speeds, demonstrating strong robustness to data sparsity and dynamic objects in driving scenes.

Abstract

We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling effective static-dynamic decomposition and high-fidelity surface reconstruction in complex driving scenarios. Our approach employs a two-stage optimization pipeline of dynamic street Gaussians. In the first stage, we extract 2D motion masks based on the observation that 3D Gaussian Splatting inherently can reconstruct only the static regions in dynamic environments. These extracted 2D motion priors are then mapped into the Gaussian space in a differentiable manner, leveraging an efficient formulation of dynamic Gaussians in the second stage. Combined with the introduced geometric regularizations, our method are able to address the over-fitting issues caused by data sparsity in autonomous driving, reconstructing physically plausible Gaussians that align with object surfaces rather than floating in air. Furthermore, we introduce temporal cross-view consistency to ensure coherence across time and viewpoints, resulting in high-quality surface reconstruction. Comprehensive experiments demonstrate the efficiency and effectiveness of DeSiRe-GS, surpassing prior self-supervised arts and achieving accuracy comparable to methods relying on external 3D bounding box annotations. Code is available at https://github.com/chengweialan/DeSiRe-GS

DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes

TL;DR

DeSiRe-GS presents a self-supervised 4D Gaussian Splatting framework that achieves static-dynamic decomposition and high-fidelity surface reconstruction in urban driving scenes without 3D bounding boxes. It introduces a two-stage pipeline: Stage I learns 2D motion masks from render-vs-ground-truth differences using a frozen foundation model, and Stage II distills these masks into PVG-based time-varying Gaussians with velocity regularization and geometric constraints. The method employs geometric regularization, normal derivation from Gaussian scales, giant-Gaussian penalties, and temporal cross-view consistency to produce physically plausible surfaces. Across Waymo and KITTI, DeSiRe-GS delivers state-of-the-art rendering performance and competitive depth accuracy with real-time-like speeds, demonstrating strong robustness to data sparsity and dynamic objects in driving scenes.

Abstract

We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling effective static-dynamic decomposition and high-fidelity surface reconstruction in complex driving scenarios. Our approach employs a two-stage optimization pipeline of dynamic street Gaussians. In the first stage, we extract 2D motion masks based on the observation that 3D Gaussian Splatting inherently can reconstruct only the static regions in dynamic environments. These extracted 2D motion priors are then mapped into the Gaussian space in a differentiable manner, leveraging an efficient formulation of dynamic Gaussians in the second stage. Combined with the introduced geometric regularizations, our method are able to address the over-fitting issues caused by data sparsity in autonomous driving, reconstructing physically plausible Gaussians that align with object surfaces rather than floating in air. Furthermore, we introduce temporal cross-view consistency to ensure coherence across time and viewpoints, resulting in high-quality surface reconstruction. Comprehensive experiments demonstrate the efficiency and effectiveness of DeSiRe-GS, surpassing prior self-supervised arts and achieving accuracy comparable to methods relying on external 3D bounding box annotations. Code is available at https://github.com/chengweialan/DeSiRe-GS

Paper Structure

This paper contains 25 sections, 26 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: DeSiRe-GS. We present a 4D street gaussian splatting representation for self-supervised static-dynamic decomposition and high-fidelity surface reconstruction without the requirement for extra 3D annotations such as bounding boxes.
  • Figure 2: Pipeline of DeSiRe-GS. To tackle the challenges in self-supervised street scene decomposition. The entire pipeline is optimized without extra annotations in a self-supervised manner, leading to superior scene decomposition ability and rendering quality.
  • Figure 3: Gaussian Scale Regularization.
  • Figure 4: Cross-view consistency
  • Figure 5: Qualitative comparison with self-supervised S3Gaussian s3g_2024_arxiv and PVG pvg_2023_arxiv
  • ...and 6 more figures