Table of Contents
Fetching ...

Gaga: Group Any Gaussians via 3D-aware Memory Bank

Weijie Lyu, Xueting Li, Abhijit Kundu, Yi-Hsuan Tsai, Ming-Hsuan Yang

TL;DR

Gaga introduces a 3D-aware memory bank to achieve cross-view, open-world 3D segmentation by grouping Gaussians from Gaussian Splatting according to overlap in 3D space, guided by depth information. It lifts inconsistent 2D masks into a unified 3D segmentation by associating masks across views and learning a per-Gaussian identity encoding for multi-view segmentation rendering. The approach is compatible with arbitrary 2D segmentation models and demonstrates superior performance and robustness across datasets (LERF-Mask, Replica, ScanNet, MipNeRF 360) and low-data regimes, with strong benefits for scene manipulation. This yields practical impact for 3D scene understanding and editability, enabling precise object-level edits and consistent 3D labeling in open-world environments.

Abstract

We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot class-agnostic segmentation models. Contrasted to prior 3D scene segmentation approaches that rely on video object tracking or contrastive learning methods, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses through a novel 3D-aware memory bank. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot class-agnostic segmentation models, significantly enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as 3D scene understanding and manipulation.

Gaga: Group Any Gaussians via 3D-aware Memory Bank

TL;DR

Gaga introduces a 3D-aware memory bank to achieve cross-view, open-world 3D segmentation by grouping Gaussians from Gaussian Splatting according to overlap in 3D space, guided by depth information. It lifts inconsistent 2D masks into a unified 3D segmentation by associating masks across views and learning a per-Gaussian identity encoding for multi-view segmentation rendering. The approach is compatible with arbitrary 2D segmentation models and demonstrates superior performance and robustness across datasets (LERF-Mask, Replica, ScanNet, MipNeRF 360) and low-data regimes, with strong benefits for scene manipulation. This yields practical impact for 3D scene understanding and editability, enabling precise object-level edits and consistent 3D labeling in open-world environments.

Abstract

We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot class-agnostic segmentation models. Contrasted to prior 3D scene segmentation approaches that rely on video object tracking or contrastive learning methods, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses through a novel 3D-aware memory bank. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot class-agnostic segmentation models, significantly enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as 3D scene understanding and manipulation.
Paper Structure (27 sections, 4 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 4 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Gaga groups any Gaussians in an open-world 3D scene and renders multi-view consistent segmentation (pixels of the same region across views are represented with the same color). By employing a 3D-aware memory bank, we eliminate the label inconsistency that exists in 2D segmentation predicted by foundational models and assign each mask across different views a universal group ID. This enables the process of lifting 2D segmentation to a consistent 3D segmentation. Gaga produces accurate 3D object segmentation, achieving high-quality results for downstream applications such as scene manipulation (e.g. changing the cushion's color of the footstool to maroon)
  • Figure 2: Comparison of rendered segmentation. Contrastive learning-based methods, such as OmniSeg omniseg3d, do not provide unique mask labels to each segmentation group, leading to inconsistencies across multiple views (e.g., the coffee table). Gaussian Grouping gg addresses multi-view segmentation by utilizing a video tracker, but it often misidentifies objects when similar items are present (e.g., the leather sofa) and struggles with significant camera perspective changes. In contrast, Gaga ensures multi-view consistent segmentation masks, overcoming these limitations.
  • Figure 3: Overview of Gaga.Gaga reconstructs 3D scenes using Gaussian Splatting and adopts any open-world model to generate 2D segmentation masks. To eliminate the 2D mask label inconsistency, we design a mask association process, where a 3D-aware memory bank is employed to assign a consistent group ID across different views to each 2D mask based on the 3D Gaussians projected to that mask (Sec. \ref{['subsec:gaga']}). Specifically, we find the corresponding Gaussians projected to 2D mask and assign the mask with the group ID in the memory bank with the maximum overlapped Gaussians (Eq. \ref{['equ:overlap']}) After 3D-aware mask association process, we use masks with multi-view consistent group IDs as pseudo labels to train an identity encoding on each 3D Gaussian for segmentation rendering.
  • Figure 4: Corresponding Gaussians with Depth Guidance. In View 1, we select Gaussians (colored in red) that are splatted inside the mask of the bulldozer. As shown in column 3, some of these Gaussians not only belong to the bulldozer but also represent background objects, as seen in View 2. To refine the selection, we render the depth map and retain only the Gaussians within the specified depth region, ensuring they correspond to the bulldozer's mask. As shown in column 4, the final selection with depth guidance consists primarily of Gaussians belonging to the bulldozer.
  • Figure 5: Qualitative results on LERF-Mask. Our rendered segmentation exhibits fewer artifacts and delivers more accurate segmentation results than both prior 3D class-agnostic segmentation works and language embedding works.
  • ...and 8 more figures