Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs
Shaohui Dai, Yansong Qu, Zheyan Li, Xinyang Li, Shengchuan Zhang, Liujuan Cao
TL;DR
This work tackles open-vocabulary 3D scene understanding by eliminating iterative per-view optimization and enforcing 3D semantic consistency. It introduces a training-free pipeline that constructs a multi-level semantic field on a superpoint graph derived from Gaussian primitives, leveraging SAM-guided contrastive partitioning and rendering-guided reprojection to fuse 2D semantic cues into 3D. The approach delivers state-of-the-art open-vocabulary segmentation while achieving over $30\times$ faster semantic-field reconstruction than prior methods, confirmed by experiments on LERF-OVS, 3DOVS, and ScanNet. The hierarchical design enables both coarse and fine-grained language-driven perception and supports interactive, parts-based scene editing in 3D, with strong practical implications for AR and robotics.
Abstract
Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over $30\times$ faster. Our code will be available at https://github.com/Atrovast/THGS.
