Table of Contents
Fetching ...

UrbanGS: Semantic-Guided Gaussian Splatting for Urban Scene Reconstruction

Ziwen Li, Jiaxin Huang, Runnan Chen, Yunlong Che, Yandong Guo, Tongliang Liu, Fakhri Karray, Mingming Gong

TL;DR

UrbanGS addresses the challenge of reconstructing urban scenes containing both static structures and potentially dynamic elements without manual 3D annotations. It accomplishes this with a semantic-guided decomposition that separates definite static Gaussians from potentially dynamic ones using 2D semantic maps, enforcing static invariance via $\mathcal{L}_{static}$ and applying a KNN-based ground regularization for low-textured regions, while modeling dynamic regions with a 4D Gaussian Splatting framework that uses learnable time embeddings. The method integrates a lightweight MLP to predict time-conditioned residuals, yielding end-to-end optimization that fuses appearance, depth, and semantic supervision with static/ground constraints. Empirical results on nuScenes and PandaSet show state-of-the-art reconstruction quality, strong robustness to predicted bounding boxes, and clear preservation of static content alongside accurate dynamic object modeling, highlighting the practical impact for scalable urban scene understanding and rendering.

Abstract

Reconstructing urban scenes is challenging due to their complex geometries and the presence of potentially dynamic objects. 3D Gaussian Splatting (3DGS)-based methods have shown strong performance, but existing approaches often incorporate manual 3D annotations to improve dynamic object modeling, which is impractical due to high labeling costs. Some methods leverage 4D Gaussian Splatting (4DGS) to represent the entire scene, but they treat static and dynamic objects uniformly, leading to unnecessary updates for static elements and ultimately degrading reconstruction quality. To address these issues, we propose UrbanGS, which leverages 2D semantic maps and an existing dynamic Gaussian approach to distinguish static objects from the scene, enabling separate processing of definite static and potentially dynamic elements. Specifically, for definite static regions, we enforce global consistency to prevent unintended changes in dynamic Gaussian and introduce a K-nearest neighbor (KNN)-based regularization to improve local coherence on low-textured ground surfaces. Notably, for potentially dynamic objects, we aggregate temporal information using learnable time embeddings, allowing each Gaussian to model deformations over time. Extensive experiments on real-world datasets demonstrate that our approach outperforms state-of-the-art methods in reconstruction quality and efficiency, accurately preserving static content while capturing dynamic elements.

UrbanGS: Semantic-Guided Gaussian Splatting for Urban Scene Reconstruction

TL;DR

UrbanGS addresses the challenge of reconstructing urban scenes containing both static structures and potentially dynamic elements without manual 3D annotations. It accomplishes this with a semantic-guided decomposition that separates definite static Gaussians from potentially dynamic ones using 2D semantic maps, enforcing static invariance via and applying a KNN-based ground regularization for low-textured regions, while modeling dynamic regions with a 4D Gaussian Splatting framework that uses learnable time embeddings. The method integrates a lightweight MLP to predict time-conditioned residuals, yielding end-to-end optimization that fuses appearance, depth, and semantic supervision with static/ground constraints. Empirical results on nuScenes and PandaSet show state-of-the-art reconstruction quality, strong robustness to predicted bounding boxes, and clear preservation of static content alongside accurate dynamic object modeling, highlighting the practical impact for scalable urban scene understanding and rendering.

Abstract

Reconstructing urban scenes is challenging due to their complex geometries and the presence of potentially dynamic objects. 3D Gaussian Splatting (3DGS)-based methods have shown strong performance, but existing approaches often incorporate manual 3D annotations to improve dynamic object modeling, which is impractical due to high labeling costs. Some methods leverage 4D Gaussian Splatting (4DGS) to represent the entire scene, but they treat static and dynamic objects uniformly, leading to unnecessary updates for static elements and ultimately degrading reconstruction quality. To address these issues, we propose UrbanGS, which leverages 2D semantic maps and an existing dynamic Gaussian approach to distinguish static objects from the scene, enabling separate processing of definite static and potentially dynamic elements. Specifically, for definite static regions, we enforce global consistency to prevent unintended changes in dynamic Gaussian and introduce a K-nearest neighbor (KNN)-based regularization to improve local coherence on low-textured ground surfaces. Notably, for potentially dynamic objects, we aggregate temporal information using learnable time embeddings, allowing each Gaussian to model deformations over time. Extensive experiments on real-world datasets demonstrate that our approach outperforms state-of-the-art methods in reconstruction quality and efficiency, accurately preserving static content while capturing dynamic elements.

Paper Structure

This paper contains 14 sections, 16 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Qualitative comparison on the nuScenes caesar2020nuscenes dataset. While DeformGS yang2024deformable achieves comparable results on static regions, it fails on dynamic objects, producing severe artifacts and blurred reconstructions. In contrast, our Urban4D maintains high fidelity for both dynamic objects and static backgrounds, also surpassing the reconstruction quality of PVG chen2023periodic.
  • Figure 2: Semantic-guided decomposition over time. For each timestamp ($T_1$, $T_2$, $T_3$), semantic Gaussians of the current frame are obtained through rendering and supervision of corresponding semantic maps. Dynamic classes include vehicles, pedestrians, and cyclists, while the static set comprises buildings, vegetation, and roads. For simplicity, we use the "Road" to represent ground surfaces.
  • Figure 3: Overview of UrbanGS framework. Given input images with semantic information during training, Gaussians are classified into definite static and potentially dynamic elements through semantic-guided decomposition. For definitively static Gaussians, we introduce a static invariance constraint to preserve their temporal invariance and prevent unintended transformations. To address challenges in low-texture regions (e.g., ground surfaces), a KNN-based regularization mechanism is employed to enforce structural coherence. Potentially dynamic objects are represented in 4D Gaussian Splatting that captures motion patterns by incorporating a learnable time embedding, with deformations predicted at desired timestamps using an MLP.
  • Figure 4: Comparison of reconstruction quality across consecutive frames. DeformGS yang2024deformable struggles significantly with reconstructing dynamic objects, resulting in severe artifacts and a failure to accurately represent motion. PVG chen2023periodic captures dynamic vehicles to some extent but suffers from noticeable blurring, particularly in the lower parts of the objects. In contrast, UrbanGS delivers superior reconstruction quality, maintaining high fidelity and preserving clear details throughout the dynamic objects.
  • Figure 5: Ablation study on the effectiveness of static regularization. Results without static regularization (left) are blurry, while adding it (right) produces sharper details.