Table of Contents
Fetching ...

InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

Haijie Li, Yanmin Wu, Jiarui Meng, Qiankun Gao, Zhiyao Zhang, Ronggang Wang, Jian Zhang

TL;DR

InstanceGaussian tackles 3D scene understanding by jointly learning appearance and instance-level semantics within a balanced Gaussian representation. It introduces Semantic-Scaffold-GS to decouple and harmonize appearance and semantics, a progressive training schedule to stabilize learning, and a bottom-up, category-agnostic instantiation that uses FPS and graph connectivity to form complete objects. The approach yields state-of-the-art results on open-vocabulary 3D point-level segmentation and demonstrates strong open-world understanding capabilities, including text-based querying and rendering. This combination of balanced Gaussian representations, guided training, and bottom-up aggregation advances accurate, scalable 3D perception in open environments, with practical implications for robotics, autonomous systems, and AR interfaces.

Abstract

3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations. However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine-grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance-based Gaussians often misrepresent object boundaries; and 3) reliance on top-down instance segmentation methods, which struggle with uneven category distributions, leading to over- or under-segmentation. In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include: i) a novel Semantic-Scaffold-GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance-semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom-up, category-agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation, highlighting the effectiveness of the proposed representation and training strategies. Project page: https://lhj-git.github.io/InstanceGaussian/

InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

TL;DR

InstanceGaussian tackles 3D scene understanding by jointly learning appearance and instance-level semantics within a balanced Gaussian representation. It introduces Semantic-Scaffold-GS to decouple and harmonize appearance and semantics, a progressive training schedule to stabilize learning, and a bottom-up, category-agnostic instantiation that uses FPS and graph connectivity to form complete objects. The approach yields state-of-the-art results on open-vocabulary 3D point-level segmentation and demonstrates strong open-world understanding capabilities, including text-based querying and rendering. This combination of balanced Gaussian representations, guided training, and bottom-up aggregation advances accurate, scalable 3D perception in open environments, with practical implications for robotics, autonomous systems, and AR interfaces.

Abstract

3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations. However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine-grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance-based Gaussians often misrepresent object boundaries; and 3) reliance on top-down instance segmentation methods, which struggle with uneven category distributions, leading to over- or under-segmentation. In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include: i) a novel Semantic-Scaffold-GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance-semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom-up, category-agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation, highlighting the effectiveness of the proposed representation and training strategies. Project page: https://lhj-git.github.io/InstanceGaussian/

Paper Structure

This paper contains 28 sections, 6 equations, 13 figures, 13 tables, 1 algorithm.

Figures (13)

  • Figure 1: Top row: Appearance-semantic joint Gaussian representation avoids the imbalance and inconsistency in appearance-semantic learning. Bottom row: Bottom-up instantiation: Over-segmentation is achieved via FPS sampling and clustering, followed by instantiation through graph-connectivity-based aggregation.
  • Figure 2: Progressive appearance-semantic joint training. (a) Train appearance only; (b) Independent appearance-semantic training; (c) Joint appearance-semantic training.
  • Figure 3: Visualization comparison of category-agnostic 3D instance segmentation result. InstanceGaussian outperforms OpenGaussian and GaussainGrouping in accurately distinguishing different 3D objects.
  • Figure 4: Open-vocabulary query point cloud Understanding on ScanNet dataset. InstanceGaussian shows advanced text query capabilities.
  • Figure 5: Open-vocabulary 3D object selection and rendering on the LeRF dataset. InstanceGaussian outperforms OpenGaussian in accurately identifying the objects' boundaries by text queries.
  • ...and 8 more figures