Table of Contents
Fetching ...

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, Lei Ke

TL;DR

Gaussian Grouping addresses the lack of object-level semantics in real-time 3D scene representations by augmenting Gaussian Splatting with Identity Encodings per Gaussian (length $16$) and supervising them via $2$D SAM masks and a $3$D spatial consistency regularization. The method lifts SAM's 2D segmentation to 3D through differentiable rendering, enabling joint reconstruction, segmentation, and editing of open-world scenes. A $2$D Identity Loss and a $3$D Regularization Loss based on the $k$ nearest neighbors guide the grouping of Gaussians into instance or stuff identities, while maintaining reconstruction quality. The resulting representation supports efficient Local Gaussian Editing—object removal, inpainting, colorization, and style transfer—with competitive or superior segmentation performance and faster editing than NeRF-based approaches. Code is released at the provided GitHub link.

Abstract

The recent Gaussian Splatting achieves high-quality and real-time novel-view synthesis of the 3D scenes. However, it is solely concentrated on the appearance and geometry modeling, while lacking in fine-grained object-level scene understanding. To address this issue, we propose Gaussian Grouping, which extends Gaussian Splatting to jointly reconstruct and segment anything in open-world 3D scenes. We augment each Gaussian with a compact Identity Encoding, allowing the Gaussians to be grouped according to their object instance or stuff membership in the 3D scene. Instead of resorting to expensive 3D labels, we supervise the Identity Encodings during the differentiable rendering by leveraging the 2D mask predictions by Segment Anything Model (SAM), along with introduced 3D spatial consistency regularization. Compared to the implicit NeRF representation, we show that the discrete and grouped 3D Gaussians can reconstruct, segment and edit anything in 3D with high visual quality, fine granularity and efficiency. Based on Gaussian Grouping, we further propose a local Gaussian Editing scheme, which shows efficacy in versatile scene editing applications, including 3D object removal, inpainting, colorization, style transfer and scene recomposition. Our code and models are at https://github.com/lkeab/gaussian-grouping.

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

TL;DR

Gaussian Grouping addresses the lack of object-level semantics in real-time 3D scene representations by augmenting Gaussian Splatting with Identity Encodings per Gaussian (length ) and supervising them via D SAM masks and a D spatial consistency regularization. The method lifts SAM's 2D segmentation to 3D through differentiable rendering, enabling joint reconstruction, segmentation, and editing of open-world scenes. A D Identity Loss and a D Regularization Loss based on the nearest neighbors guide the grouping of Gaussians into instance or stuff identities, while maintaining reconstruction quality. The resulting representation supports efficient Local Gaussian Editing—object removal, inpainting, colorization, and style transfer—with competitive or superior segmentation performance and faster editing than NeRF-based approaches. Code is released at the provided GitHub link.

Abstract

The recent Gaussian Splatting achieves high-quality and real-time novel-view synthesis of the 3D scenes. However, it is solely concentrated on the appearance and geometry modeling, while lacking in fine-grained object-level scene understanding. To address this issue, we propose Gaussian Grouping, which extends Gaussian Splatting to jointly reconstruct and segment anything in open-world 3D scenes. We augment each Gaussian with a compact Identity Encoding, allowing the Gaussians to be grouped according to their object instance or stuff membership in the 3D scene. Instead of resorting to expensive 3D labels, we supervise the Identity Encodings during the differentiable rendering by leveraging the 2D mask predictions by Segment Anything Model (SAM), along with introduced 3D spatial consistency regularization. Compared to the implicit NeRF representation, we show that the discrete and grouped 3D Gaussians can reconstruct, segment and edit anything in 3D with high visual quality, fine granularity and efficiency. Based on Gaussian Grouping, we further propose a local Gaussian Editing scheme, which shows efficacy in versatile scene editing applications, including 3D object removal, inpainting, colorization, style transfer and scene recomposition. Our code and models are at https://github.com/lkeab/gaussian-grouping.
Paper Structure (44 sections, 4 equations, 17 figures, 6 tables, 1 algorithm)

This paper contains 44 sections, 4 equations, 17 figures, 6 tables, 1 algorithm.

Figures (17)

  • Figure 1: Our Gaussian Grouping jointly reconstructs (column a) and segments (column b) anything in full open-world 3D scenes, with fine-grained instance and stuff level modeling. This enables versatile scene editing applications, such as 3D object removal (column c), 3D object inpainting (column d, which first removes the 3D object and then inpaints the holes) and scene re-composition and object colorization (column e). Since the segmentation information is encapsulated in the 3D Gaussians, editing tasks such as 3D object removal, colorization and object location exchange can be performed directly w/o training, while inpainting only requires minutes of fine-tuning.
  • Figure 2: The method pipeline of our Gaussian Grouping contains three main steps: (a) We first prepare the input by deploying SAM to automatically generate masks in everything mode for each view independently. (b) Then, to obtain the consistent mask IDs across training views, we take a universal temporal propagation model cheng2023tracking to associate the mask labels and generate a coherent multi-view segmentation. (c) With the prepared training input, we jointly learn all properties of the 3D Gaussians, including their group Identity Encoding, by differentiable rendering. Our encoding is supervised by the 2D Identity Loss, leveraging the coherent segmentation views, and a 3D Regularization loss. We use color the denote object IDs across frames for input views. We omit the rendering process for other Gaussian parameters and the density control part for simplicity, as it is inherited from kerbl20233d.
  • Figure 3: The grouped 3D Gaussians after training, where each group represents a specific instance / stuff of the 3D scene and can be fully decoupled. Our representation is efficient to support versatile downstream scene editing applications, where we design a Gaussian Operation List consisting of simple operations like group deletion, group addition, finetuning Spherical Harmonic (SH) and exchanging 3D center locations.
  • Figure 4: Ablation on the Identity Consistency across views, where we treat multi-view images as a video and associate the mask labels to generate coherent segmentation labels cheng2023tracking for training. We founding using cost-based linear assignment siddiqui2023panoptic leads to slower training and inferior testing performance in both reconstruction and segmentation.
  • Figure 5: Robustness to input masks errors on Mip-NeRF 360 barron2022mipnerf360. In the 2nd and 3rd columns (middle two views), SAM + DEVA fails to segment and associate the chair across frames. However, owing to the shared 3D Gaussian representation during reconstruction, Gaussian Grouping successfully corrects the error in mask labels and segments the black chair during the multi-view rendering.
  • ...and 12 more figures