HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

Hongyu Zhou; Jiahao Shao; Lu Xu; Dongfeng Bai; Weichao Qiu; Bingbing Liu; Yue Wang; Andreas Geiger; Yiyi Liao

HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

Hongyu Zhou, Jiahao Shao, Lu Xu, Dongfeng Bai, Weichao Qiu, Bingbing Liu, Yue Wang, Andreas Geiger, Yiyi Liao

TL;DR

HUGS presents a RGB-only framework for holistic 3D understanding of urban scenes by extending Gaussian Splatting to jointly model static regions and multiple dynamic objects through 3D Gaussians. Dynamic object motion is constrained by a unicycle model, and Gaussians carry appearance, semantic logits, and flow, enabling simultaneous rendering of RGB images, semantic maps, and optical flow with exposure adaptation. The method achieves state-of-the-art results on novel-view synthesis, 3D semantic reconstruction, and scene editing across KITTI, KITTI-360, and Virtual KITTI 2, while functioning in real time (≈93 fps) and requiring only noisy 2D/3D cues as supervision. By integrating 3D semantic normalization and multi-modal losses, HUGS reduces reliance on ground-truth 3D boxes and LiDAR, offering a robust RGB-only path to dynamic urban scene understanding with practical applications in simulation, perception, and editing.

Abstract

Holistic understanding of urban scenes based on RGB images is a challenging yet important problem. It encompasses understanding both the geometry and appearance to enable novel view synthesis, parsing semantic labels, and tracking moving objects. Despite considerable progress, existing approaches often focus on specific aspects of this task and require additional inputs such as LiDAR scans or manually annotated 3D bounding boxes. In this paper, we introduce a novel pipeline that utilizes 3D Gaussian Splatting for holistic urban scene understanding. Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians, where moving object poses are regularized via physical constraints. Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy, and reconstruct dynamic scenes, even in scenarios where 3D bounding box detection are highly noisy. Experimental results on KITTI, KITTI-360, and Virtual KITTI 2 demonstrate the effectiveness of our approach.

HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

TL;DR

Abstract

Paper Structure (24 sections, 21 equations, 18 figures, 10 tables)

This paper contains 24 sections, 21 equations, 18 figures, 10 tables.

Introduction
Related Work
Method
Decomposed Scene Representation
Holistic Urban Gaussian Splatting
Loss Functions
Implementation Details
Experiments
Novel View Synthesis
Semantic and Geometric Scene Understanding
Scene Editing
Ablation Study
Conclusion
Implementation
3D Gaussian Details
...and 9 more sections

Figures (18)

Figure 1: Illustration. Given posed RGB images as input, our method lifts noisy 2D & 3D predictions to the 3D space via decomposed 3D Gaussians, and enables holistic scene understanding in 2D and 3D space.
Figure 2: Method Overview. We decompose the scene into static regions and $N$ rigidly moving dynamic objects. Each dynamic object is represented using 3D Gaussians in its canonical space and then transformed to the world coordinates based on transformations constrained by a unicycle model. We use $N$ unicycle models of different parameters to individually represent the motion of $N$ dynamic objects. Each 3D Gaussian encompasses information about appearance and semantics, whereas the optical flow can be obtained by calculating the Gaussian center's motion, enabling the rendering of RGB images, semantic maps, and optical flow within a unified model. Our method is supervised using RGB images, noisy 2D semantic labels, and noisy optical flow, denoted as $\mathcal{L}_{\mathbf{I}}$, $\mathcal{L}_{\mathbf{S}}$, and $\mathcal{L}_{\mathbf{F}}$, respectively.
Figure 3: 3D Semantic Reconstruction. Comparison between applying softmax to accumulated 2D semantic logits (left) and to 3D semantic logits (right). Normalizing semantic logits in 3D space clearly reduces floaters and yields better 3D semantic reconstruction than the 2D normalization counterpart.
Figure 4: Qualitative Comparison on KITTI and vKITTI. We use monocular-based 3D bounding box predictions for KITTI, and manually jittered 3D bounding boxes for vKITTI. We zoom in on a patch of a dynamic object for each KITTI scene.
Figure 5: Details Qualitative Comparison with MARS on KITTI-360 Leaderboard.
...and 13 more figures

HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

TL;DR

Abstract

HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

Authors

TL;DR

Abstract

Table of Contents

Figures (18)