UrbanGS: Semantic-Guided Gaussian Splatting for Urban Scene Reconstruction
Ziwen Li, Jiaxin Huang, Runnan Chen, Yunlong Che, Yandong Guo, Tongliang Liu, Fakhri Karray, Mingming Gong
TL;DR
UrbanGS addresses the challenge of reconstructing urban scenes containing both static structures and potentially dynamic elements without manual 3D annotations. It accomplishes this with a semantic-guided decomposition that separates definite static Gaussians from potentially dynamic ones using 2D semantic maps, enforcing static invariance via $\mathcal{L}_{static}$ and applying a KNN-based ground regularization for low-textured regions, while modeling dynamic regions with a 4D Gaussian Splatting framework that uses learnable time embeddings. The method integrates a lightweight MLP to predict time-conditioned residuals, yielding end-to-end optimization that fuses appearance, depth, and semantic supervision with static/ground constraints. Empirical results on nuScenes and PandaSet show state-of-the-art reconstruction quality, strong robustness to predicted bounding boxes, and clear preservation of static content alongside accurate dynamic object modeling, highlighting the practical impact for scalable urban scene understanding and rendering.
Abstract
Reconstructing urban scenes is challenging due to their complex geometries and the presence of potentially dynamic objects. 3D Gaussian Splatting (3DGS)-based methods have shown strong performance, but existing approaches often incorporate manual 3D annotations to improve dynamic object modeling, which is impractical due to high labeling costs. Some methods leverage 4D Gaussian Splatting (4DGS) to represent the entire scene, but they treat static and dynamic objects uniformly, leading to unnecessary updates for static elements and ultimately degrading reconstruction quality. To address these issues, we propose UrbanGS, which leverages 2D semantic maps and an existing dynamic Gaussian approach to distinguish static objects from the scene, enabling separate processing of definite static and potentially dynamic elements. Specifically, for definite static regions, we enforce global consistency to prevent unintended changes in dynamic Gaussian and introduce a K-nearest neighbor (KNN)-based regularization to improve local coherence on low-textured ground surfaces. Notably, for potentially dynamic objects, we aggregate temporal information using learnable time embeddings, allowing each Gaussian to model deformations over time. Extensive experiments on real-world datasets demonstrate that our approach outperforms state-of-the-art methods in reconstruction quality and efficiency, accurately preserving static content while capturing dynamic elements.
