Table of Contents
Fetching ...

TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding

Jingbin You, Zehao Li, Hao Jiang, Xinzhu Ma, Shuqin Gao, Honglong Zhao, Congcong Zheng, Tianlu Mao, Feng Dai, Yucheng Zhang, Zhaoqi Wang

Abstract

3D Gaussian Splatting (3DGS) has emerged as a real-time, differentiable representation for neural scene understanding. However, existing 3DGS-based methods struggle to represent hierarchical 3D semantic structures and capture whole-part relationships in complex scenes. Moreover, dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, resulting in suboptimal segmentation. To address these limitations, we introduce TreeGaussian, a tree-guided cascaded contrastive learning framework that explicitly models hierarchical semantic relationships and reduces redundancy in contrastive supervision. By constructing a multi-level object tree, TreeGaussian enables structured learning across object-part hierarchies. In addition, we propose a two-stage cascaded contrastive learning strategy that progressively refines feature representations from global to local, mitigating saturation and stabilizing training. A Consistent Segmentation Detection (CSD) mechanism and a graph-based denoising module are further introduced to align segmentation modes across views while suppressing unstable Gaussian points, enhancing segmentation consistency and quality. Extensive experiments, including open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies, demonstrate the effectiveness and robustness of our approach.

TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding

Abstract

3D Gaussian Splatting (3DGS) has emerged as a real-time, differentiable representation for neural scene understanding. However, existing 3DGS-based methods struggle to represent hierarchical 3D semantic structures and capture whole-part relationships in complex scenes. Moreover, dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, resulting in suboptimal segmentation. To address these limitations, we introduce TreeGaussian, a tree-guided cascaded contrastive learning framework that explicitly models hierarchical semantic relationships and reduces redundancy in contrastive supervision. By constructing a multi-level object tree, TreeGaussian enables structured learning across object-part hierarchies. In addition, we propose a two-stage cascaded contrastive learning strategy that progressively refines feature representations from global to local, mitigating saturation and stabilizing training. A Consistent Segmentation Detection (CSD) mechanism and a graph-based denoising module are further introduced to align segmentation modes across views while suppressing unstable Gaussian points, enhancing segmentation consistency and quality. Extensive experiments, including open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies, demonstrate the effectiveness and robustness of our approach.

Paper Structure

This paper contains 27 sections, 9 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Motivation illustration. (a) Flat contrastive learning isolates feature spaces for object wholes and parts, limiting their hierarchical interaction. (b) Fused contrastive learning merges feature spaces but suffers from oversaturation and instability due to dense pairwise comparisons. (c) Cascaded contrastive learning (our method) preserves semantic hierarchy while minimizing contrastive redundancy, enabling efficient and stable optimization. Solid arrows indicate implicit contrastive interactions among clusters; dashed arrows denote explicit hierarchical indexing relationships.
  • Figure 2: Overview of our method. (a) Constructing object-tree from multi-view images using SAM to capture structured relationships between object parts and wholes. (b) Two-stage cascaded contrastive learning strategy to progressively optimize the instance feature of each Gaussian point. (c) Graph-based denoising is applied to each language-mapped instance cluster for improving the multi-view rendering quality.
  • Figure 3: Consistent Segmentation Detection (CSD) for local contrastive learning. The blue curve shows the raw Split Number and the red curve shows the smoothed reference across views. Views where the blue curve lies above the red reference are treated as over segmentation (apply only $\mathcal{L}_{pull}^2$), while views where it lies below are treated as under segmentation (apply only $\mathcal{L}_{push}^2$). When the blue curve is close to the reference, we treat it as optimal segmentation (apply $\mathcal{L}_{pull}^2+\mathcal{L}_{push}^2$).
  • Figure 4: Qualitative comparison of the rendered instance feature maps. Our method achieves better global feature consistency across objects (cup and spoon) at the whole scale and exhibits clearer feature separation at the part scale.
  • Figure 5: Qualitative comparison of rendered local objects. Our method produces cleaner and more accurate segmentation results compared to baselines, effectively reducing noise and preserving fine-grained structures.
  • ...and 8 more figures