Table of Contents
Fetching ...

Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting

Runsong Zhu, Shi Qiu, Zhengzhe Liu, Ka-Hei Hui, Qianyi Wu, Pheng-Ann Heng, Chi-Wing Fu

TL;DR

Unified-Lift addresses the challenge of end-to-end lifting of 2D instance segmentation to 3D scenes by leveraging 3D Gaussian Splatting (3D-GS) and introducing a global object-level codebook. It augments each Gaussian point with a Gaussian-level feature learned via contrastive supervision and learns an explicit object-level representation to guide segmentation, supported by an association learning module and a noisy-label filtering mechanism. The approach achieves state-of-the-art results on multiple benchmarks (LERF-Mask, Replica, Messy Rooms) in both segmentation quality and inference efficiency, without any pre- or post-processing. This object-centric, end-to-end framework enables scalable, multi-view-consistent 3D segmentation and facilitates downstream tasks such as multi-granularity object editing in 3D scenes.

Abstract

Lifting multi-view 2D instance segmentation to a radiance field has proven to be effective to enhance 3D understanding. Existing methods rely on direct matching for end-to-end lifting, yielding inferior results; or employ a two-stage solution constrained by complex pre- or post-processing. In this work, we design a new end-to-end object-aware lifting approach, named Unified-Lift that provides accurate 3D segmentation based on the 3D Gaussian representation. To start, we augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. Importantly, we introduce a learnable object-level codebook to account for individual objects in the scene for an explicit object-level understanding and associate the encoded object-level features with the Gaussian-level point features for segmentation predictions. While promising, achieving effective codebook learning is non-trivial and a naive solution leads to degraded performance. Therefore, we formulate the association learning module and the noisy label filtering module for effective and robust codebook learning. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms datasets. Both qualitative and quantitative results manifest that our Unified-Lift clearly outperforms existing methods in terms of segmentation quality and time efficiency. The code is publicly available at \href{https://github.com/Runsong123/Unified-Lift}{https://github.com/Runsong123/Unified-Lift}.

Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting

TL;DR

Unified-Lift addresses the challenge of end-to-end lifting of 2D instance segmentation to 3D scenes by leveraging 3D Gaussian Splatting (3D-GS) and introducing a global object-level codebook. It augments each Gaussian point with a Gaussian-level feature learned via contrastive supervision and learns an explicit object-level representation to guide segmentation, supported by an association learning module and a noisy-label filtering mechanism. The approach achieves state-of-the-art results on multiple benchmarks (LERF-Mask, Replica, Messy Rooms) in both segmentation quality and inference efficiency, without any pre- or post-processing. This object-centric, end-to-end framework enables scalable, multi-view-consistent 3D segmentation and facilitates downstream tasks such as multi-granularity object editing in 3D scenes.

Abstract

Lifting multi-view 2D instance segmentation to a radiance field has proven to be effective to enhance 3D understanding. Existing methods rely on direct matching for end-to-end lifting, yielding inferior results; or employ a two-stage solution constrained by complex pre- or post-processing. In this work, we design a new end-to-end object-aware lifting approach, named Unified-Lift that provides accurate 3D segmentation based on the 3D Gaussian representation. To start, we augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. Importantly, we introduce a learnable object-level codebook to account for individual objects in the scene for an explicit object-level understanding and associate the encoded object-level features with the Gaussian-level point features for segmentation predictions. While promising, achieving effective codebook learning is non-trivial and a naive solution leads to degraded performance. Therefore, we formulate the association learning module and the noisy label filtering module for effective and robust codebook learning. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms datasets. Both qualitative and quantitative results manifest that our Unified-Lift clearly outperforms existing methods in terms of segmentation quality and time efficiency. The code is publicly available at \href{https://github.com/Runsong123/Unified-Lift}{https://github.com/Runsong123/Unified-Lift}.

Paper Structure

This paper contains 31 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparing the pipeline of our method against previous lifting solutions.
  • Figure 2: Overview of our Unified-Lift, which is built based on the 3D Gaussian Splatting (3D-GS) representation (top-left). In our pipeline, we first augment each Gaussian point in 3D-GS with a Gaussian-level feature and utilize the contrastive loss to optimize the rendered features (see top; detailed in Sec. \ref{['sec:preliminary']}). To impose an object-level understanding on the 3D scene, we introduce an additional object-level codebook and establish associations between the object-level features and the Gaussian-level features (see bottom-left; detailed in Sec. \ref{['sec:codebook']}). Further, we propose two novel modules, the association learning module and the noisy label filtering module, to robustly and accurately learn the codebook (see bottom-right; detailed in Sec. \ref{['sec:training']}).
  • Figure 3: Visual comparisons. Segmentation results produced by our method and the Gaussian-level feature-based method ying2024omniseg3d with post-processing mcinnes2017hdbscan. Their result tends to overlook small objects and produces artifacts. In contrast, our method generates more accurate segmentations.
  • Figure 4: The comparison between the generated pseudo label results by Panoptic Lifting siddiqui2023panoptic and our method. With the designed area-aware ID mapping, we can obtain more view-consistent segmentation as the pseudo labels to facilitate the codebook learning.
  • Figure 5: Visual comparison of the generated uncertainty maps and 2D instance segmentation masks from different views from the "Office3" scene in the Replica dataset straub2019replica.
  • ...and 1 more figures