Table of Contents
Fetching ...

Segment Anything in 3D with Radiance Fields

Jiazhong Cen, Jiemin Fang, Zanwei Zhou, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian

TL;DR

This paper introduces SA3D, a method to extend the 2D segmentation capabilities of SAM to 3D by leveraging radiance fields (e.g., NeRF and 3D-GS) as a 3D prior. The approach uses an efficient, iterative pipeline: render a 2D mask from a single view with SAM, project it into 3D via mask inverse rendering, and automatically generate prompts from rendered masks to refine SAM’s outputs in new views (cross-view self-prompting). It employs two 3D mask representations to match NeRF and 3D-GS, a mask projection loss with a negative refinement term, and an IoU-based view rejection to mitigate occlusions, along with an optional ambiguous-Gaussians removal step for 3D-GS. Across NVOS, SPIn-NeRF, and Replica datasets, SA3D achieves high 3D segmentation quality with real-time or near-real-time performance, demonstrating a practical path to leveraging 2D foundation models for 3D perception without additional 3D training.

Abstract

The Segment Anything Model (SAM) emerges as a powerful vision foundation model to generate high-quality 2D segmentation results. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the radiance field as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, short for Segment Anything in 3D. With SA3D, the user is only required to provide a 2D segmentation prompt (e.g., rough points) for the target object in a single view, which is used to generate its corresponding 2D mask with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively refine the 3D mask of the target object. For one view, mask inverse rendering projects the 2D mask obtained by SAM into the 3D space with guidance of the density distribution learned by the radiance field for 3D mask refinement; Then, cross-view self-prompting extracts reliable prompts automatically as the input to SAM from the rendered 2D mask of the inaccurate 3D mask for a new view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds. Our research reveals a potential methodology to lift the ability of a 2D segmentation model to 3D. Our code is available at https://github.com/Jumpat/SegmentAnythingin3D.

Segment Anything in 3D with Radiance Fields

TL;DR

This paper introduces SA3D, a method to extend the 2D segmentation capabilities of SAM to 3D by leveraging radiance fields (e.g., NeRF and 3D-GS) as a 3D prior. The approach uses an efficient, iterative pipeline: render a 2D mask from a single view with SAM, project it into 3D via mask inverse rendering, and automatically generate prompts from rendered masks to refine SAM’s outputs in new views (cross-view self-prompting). It employs two 3D mask representations to match NeRF and 3D-GS, a mask projection loss with a negative refinement term, and an IoU-based view rejection to mitigate occlusions, along with an optional ambiguous-Gaussians removal step for 3D-GS. Across NVOS, SPIn-NeRF, and Replica datasets, SA3D achieves high 3D segmentation quality with real-time or near-real-time performance, demonstrating a practical path to leveraging 2D foundation models for 3D perception without additional 3D training.

Abstract

The Segment Anything Model (SAM) emerges as a powerful vision foundation model to generate high-quality 2D segmentation results. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the radiance field as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, short for Segment Anything in 3D. With SA3D, the user is only required to provide a 2D segmentation prompt (e.g., rough points) for the target object in a single view, which is used to generate its corresponding 2D mask with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively refine the 3D mask of the target object. For one view, mask inverse rendering projects the 2D mask obtained by SAM into the 3D space with guidance of the density distribution learned by the radiance field for 3D mask refinement; Then, cross-view self-prompting extracts reliable prompts automatically as the input to SAM from the rendered 2D mask of the inaccurate 3D mask for a new view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds. Our research reveals a potential methodology to lift the ability of a 2D segmentation model to 3D. Our code is available at https://github.com/Jumpat/SegmentAnythingin3D.
Paper Structure (37 sections, 11 equations, 12 figures, 10 tables)

This paper contains 37 sections, 11 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Given a pre-trained radiance field, SA3D takes prompts from one single rendered view as input and outputs the 3D segmentation result for the specific target.
  • Figure 2: The overall pipeline of SA3D. Given a set of multi-view 2D images and a radiance field trained on it, SA3D first employs the SAM encoder to extract all features from the images and builds a cache. Then SA3D takes prompts in a single view for the target object as input and uses SAM to produce a 2D mask in this view with these prompts. Subsequently, SA3D performs an alternated process of mask inverse rendering and cross-view self-prompting to refine the 3D mask of the target object. Mask inverse rendering is performed to project the 2D mask obtained by the SAM decoder into the 3D space according to the learned density distribution embedded in the radiance field for 3D mask refinement. Cross-view self-prompting is conducted to extract reliable prompts automatically as the input to the SAM decoder from the rendered 2D mask given a novel view. This alternated process is executed iteratively until we get the accurate 3D mask.
  • Figure 3: Illustration of the proposed self-prompting strategy. It converts the 2D-rendered mask of a new view into corresponding point prompts, leveraging the 3D geometry prior and 2D-rendered mask confidence.
  • Figure 4: Illustration of ambiguous Gaussians at the interface between two objects. The proposed ambiguous Gaussians removal method can alleviate such phenomenon effectively.
  • Figure 5: Some visualization results in different scenes (LERF-donuts lerf, LERF-figurines, 360-kitchen mipnerf360, LERF-bouquet, T&T-truck tanks and LERF-nerfgun).
  • ...and 7 more figures