Table of Contents
Fetching ...

Lifting by Gaussians: A Simple, Fast and Flexible Method for 3D Instance Segmentation

Rohan Chacko, Nicolai Haeni, Eldar Khaliullin, Lin Sun, Douglas Lee

TL;DR

Open-world 3D instance segmentation on existing 3D Gaussian Splatting Fields is challenging due to the lack of 3D foundation models and heavy training requirements. Lifting-by-Gaussians (LBG) tackles this by attaching 2D SAM masks and 2D foundation-model features (CLIP, DINOv2) to 3D Gaussians via a per-pixel max-contributor assignment, then incrementally merging fragments across frames with geometric and semantic cues and a hierarchical object–part–subpart decomposition. The approach is training-free and parameterization-agnostic, yielding higher-quality 3D assets with substantially faster processing than prior methods, and it maintains competitive performance in 2D novel-view mask rendering. This enables rapid 3D scene understanding suitable for AR/VR, robotics, and large-scale 3D reconstruction, with potential extensions to lift additional 2D features into 3D and refine small-object segmentation.

Abstract

We introduce Lifting By Gaussians (LBG), a novel approach for open-world instance segmentation of 3D Gaussian Splatted Radiance Fields (3DGS). Recently, 3DGS Fields have emerged as a highly efficient and explicit alternative to Neural Field-based methods for high-quality Novel View Synthesis. Our 3D instance segmentation method directly lifts 2D segmentation masks from SAM (alternately FastSAM, etc.), together with features from CLIP and DINOv2, directly fusing them onto 3DGS (or similar Gaussian radiance fields such as 2DGS). Unlike previous approaches, LBG requires no per-scene training, allowing it to operate seamlessly on any existing 3DGS reconstruction. Our approach is not only an order of magnitude faster and simpler than existing approaches; it is also highly modular, enabling 3D semantic segmentation of existing 3DGS fields without requiring a specific parametrization of the 3D Gaussians. Furthermore, our technique achieves superior semantic segmentation for 2D semantic novel view synthesis and 3D asset extraction results while maintaining flexibility and efficiency. We further introduce a novel approach to evaluate individually segmented 3D assets from 3D radiance field segmentation methods.

Lifting by Gaussians: A Simple, Fast and Flexible Method for 3D Instance Segmentation

TL;DR

Open-world 3D instance segmentation on existing 3D Gaussian Splatting Fields is challenging due to the lack of 3D foundation models and heavy training requirements. Lifting-by-Gaussians (LBG) tackles this by attaching 2D SAM masks and 2D foundation-model features (CLIP, DINOv2) to 3D Gaussians via a per-pixel max-contributor assignment, then incrementally merging fragments across frames with geometric and semantic cues and a hierarchical object–part–subpart decomposition. The approach is training-free and parameterization-agnostic, yielding higher-quality 3D assets with substantially faster processing than prior methods, and it maintains competitive performance in 2D novel-view mask rendering. This enables rapid 3D scene understanding suitable for AR/VR, robotics, and large-scale 3D reconstruction, with potential extensions to lift additional 2D features into 3D and refine small-object segmentation.

Abstract

We introduce Lifting By Gaussians (LBG), a novel approach for open-world instance segmentation of 3D Gaussian Splatted Radiance Fields (3DGS). Recently, 3DGS Fields have emerged as a highly efficient and explicit alternative to Neural Field-based methods for high-quality Novel View Synthesis. Our 3D instance segmentation method directly lifts 2D segmentation masks from SAM (alternately FastSAM, etc.), together with features from CLIP and DINOv2, directly fusing them onto 3DGS (or similar Gaussian radiance fields such as 2DGS). Unlike previous approaches, LBG requires no per-scene training, allowing it to operate seamlessly on any existing 3DGS reconstruction. Our approach is not only an order of magnitude faster and simpler than existing approaches; it is also highly modular, enabling 3D semantic segmentation of existing 3DGS fields without requiring a specific parametrization of the 3D Gaussians. Furthermore, our technique achieves superior semantic segmentation for 2D semantic novel view synthesis and 3D asset extraction results while maintaining flexibility and efficiency. We further introduce a novel approach to evaluate individually segmented 3D assets from 3D radiance field segmentation methods.

Paper Structure

This paper contains 22 sections, 1 equation, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Lifting by Gaussians (LBG). LBG utilizes 2D foundation model masks to segment any pretrained 3DGS field into objects, parts, and subparts without gradient-based learning. For each frame, 2D segmentations are lifted onto the per-pixel max-contributor Gaussian, producing object fragments. These fragments are then merged into coherent, scene-level objects based on both geometric and semantic overlap. Through a hierarchical application of this process, LBG extracts high-quality 3D objects, parts, and subparts. In contrast to learning-based methods, LBG achieves this segmentation an order of magnitude faster, enabling new applications like object manipulation in augmented reality.
  • Figure 2: LBG constructs an open-vocabulary 3D instance segmentation from a sequence of posed RGB images. A generic 2D instance segmentation model is used to segment objects, parts, and subparts in each RGB image. Semantic feature vectors are extracted for each region, and the masks are lifted to the per-pixel max-contributing Gaussian, generating per-frame 3D object fragments. These fragments are incrementally merged into coherent, scene-level 3D objects. By applying this process hierarchically to the part and subpart masks, LBG produces a hierarchical decomposition of any 3DGS scene.
  • Figure 3: Qualitative comparison on the LERF dataset for 3D Asset extraction. We show three extracted objects per scene, with two different views for each object. Compared to prior methods, the objects extracted from LBG are much cleaner and have fewer noisy artifacts. 3D objects from SAGA and Gaussian Grouping have missing parts and are of lower quality overall.
  • Figure 4: Qualitative comparison on novel view synthesis for 2D instance masks. Black regions are unassigned. We see that our 2D masks are on par with other methods. LBG picks out instances across segmentation scales better than Gaussian Grouping. Compared to SAGA, our method provides more complete masks.
  • Figure 5: Ablation on using CLIP features for merging. Using only spatial proximity leads to nearby objects being grouped together (red dashed boxes). When using DINO features together with CLIP this error is fixed.
  • ...and 7 more figures