HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting
Hongyu Zhou, Jiahao Shao, Lu Xu, Dongfeng Bai, Weichao Qiu, Bingbing Liu, Yue Wang, Andreas Geiger, Yiyi Liao
TL;DR
HUGS presents a RGB-only framework for holistic 3D understanding of urban scenes by extending Gaussian Splatting to jointly model static regions and multiple dynamic objects through 3D Gaussians. Dynamic object motion is constrained by a unicycle model, and Gaussians carry appearance, semantic logits, and flow, enabling simultaneous rendering of RGB images, semantic maps, and optical flow with exposure adaptation. The method achieves state-of-the-art results on novel-view synthesis, 3D semantic reconstruction, and scene editing across KITTI, KITTI-360, and Virtual KITTI 2, while functioning in real time (≈93 fps) and requiring only noisy 2D/3D cues as supervision. By integrating 3D semantic normalization and multi-modal losses, HUGS reduces reliance on ground-truth 3D boxes and LiDAR, offering a robust RGB-only path to dynamic urban scene understanding with practical applications in simulation, perception, and editing.
Abstract
Holistic understanding of urban scenes based on RGB images is a challenging yet important problem. It encompasses understanding both the geometry and appearance to enable novel view synthesis, parsing semantic labels, and tracking moving objects. Despite considerable progress, existing approaches often focus on specific aspects of this task and require additional inputs such as LiDAR scans or manually annotated 3D bounding boxes. In this paper, we introduce a novel pipeline that utilizes 3D Gaussian Splatting for holistic urban scene understanding. Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians, where moving object poses are regularized via physical constraints. Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy, and reconstruct dynamic scenes, even in scenarios where 3D bounding box detection are highly noisy. Experimental results on KITTI, KITTI-360, and Virtual KITTI 2 demonstrate the effectiveness of our approach.
