Table of Contents
Fetching ...

SANeRF-HQ: Segment Anything for NeRF in High Quality

Yichen Liu, Benran Hu, Chi-Keung Tang, Yu-Wing Tai

TL;DR

The paper tackles open-world 3D object segmentation within Neural Radiance Fields (NeRF) by leveraging Segment Anything Model (SAM) for promptable 2D masks and NeRF for cross-view information fusion. It introduces SANeRF-HQ, a three-component pipeline consisting of a feature container (cache or distillation), a mask decoder, and a mask aggregator that builds a 3D object field while enforcing high-quality boundaries and multi-view consistency. A key innovation is the Ray-Pair RGB loss, which aligns color-based ray similarities with segmentation predictions using error-guided local sampling to refine boundaries. Across multiple NeRF datasets, SANeRF-HQ demonstrates superior segmentation quality and robustness compared to prior zero-shot and auto-segmentation approaches, with practical implications for interactive 3D scene understanding and potential extensions to dynamic 4D scenes.

Abstract

Recently, the Segment Anything Model (SAM) has showcased remarkable capabilities of zero-shot segmentation, while NeRF (Neural Radiance Fields) has gained popularity as a method for various 3D problems beyond novel view synthesis. Though there exist initial attempts to incorporate these two methods into 3D segmentation, they face the challenge of accurately and consistently segmenting objects in complex scenarios. In this paper, we introduce the Segment Anything for NeRF in High Quality (SANeRF-HQ) to achieve high-quality 3D segmentation of any target object in a given scene. SANeRF-HQ utilizes SAM for open-world object segmentation guided by user-supplied prompts, while leveraging NeRF to aggregate information from different viewpoints. To overcome the aforementioned challenges, we employ density field and RGB similarity to enhance the accuracy of segmentation boundary during the aggregation. Emphasizing on segmentation accuracy, we evaluate our method on multiple NeRF datasets where high-quality ground-truths are available or manually annotated. SANeRF-HQ shows a significant quality improvement over state-of-the-art methods in NeRF object segmentation, provides higher flexibility for object localization, and enables more consistent object segmentation across multiple views. Results and code are available at the project site: https://lyclyc52.github.io/SANeRF-HQ/.

SANeRF-HQ: Segment Anything for NeRF in High Quality

TL;DR

The paper tackles open-world 3D object segmentation within Neural Radiance Fields (NeRF) by leveraging Segment Anything Model (SAM) for promptable 2D masks and NeRF for cross-view information fusion. It introduces SANeRF-HQ, a three-component pipeline consisting of a feature container (cache or distillation), a mask decoder, and a mask aggregator that builds a 3D object field while enforcing high-quality boundaries and multi-view consistency. A key innovation is the Ray-Pair RGB loss, which aligns color-based ray similarities with segmentation predictions using error-guided local sampling to refine boundaries. Across multiple NeRF datasets, SANeRF-HQ demonstrates superior segmentation quality and robustness compared to prior zero-shot and auto-segmentation approaches, with practical implications for interactive 3D scene understanding and potential extensions to dynamic 4D scenes.

Abstract

Recently, the Segment Anything Model (SAM) has showcased remarkable capabilities of zero-shot segmentation, while NeRF (Neural Radiance Fields) has gained popularity as a method for various 3D problems beyond novel view synthesis. Though there exist initial attempts to incorporate these two methods into 3D segmentation, they face the challenge of accurately and consistently segmenting objects in complex scenarios. In this paper, we introduce the Segment Anything for NeRF in High Quality (SANeRF-HQ) to achieve high-quality 3D segmentation of any target object in a given scene. SANeRF-HQ utilizes SAM for open-world object segmentation guided by user-supplied prompts, while leveraging NeRF to aggregate information from different viewpoints. To overcome the aforementioned challenges, we employ density field and RGB similarity to enhance the accuracy of segmentation boundary during the aggregation. Emphasizing on segmentation accuracy, we evaluate our method on multiple NeRF datasets where high-quality ground-truths are available or manually annotated. SANeRF-HQ shows a significant quality improvement over state-of-the-art methods in NeRF object segmentation, provides higher flexibility for object localization, and enables more consistent object segmentation across multiple views. Results and code are available at the project site: https://lyclyc52.github.io/SANeRF-HQ/.
Paper Structure (23 sections, 14 equations, 16 figures, 5 tables)

This paper contains 23 sections, 14 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: SANeRF-HQ Pipeline. Our method is composed of three parts: a feature container (feature cache or feature field), a mask decoder, and a mask aggregator (object field). It first renders a set of images using a pre-trained NeRF and encodes their SAM features, which are cached or used to optimize a feature field. SAM decoder takes the feature maps from the cache or the feature field, and generates 2D masks based on user prompts. The aggregator fuses 2D masks from different views to produce an object field.
  • Figure 2: Mask Decoder Architecture. The decoder consists of a prompt encoder and an attention decoder. First, the prompts are fed into the prompt encoder. The attention decoder takes the encoded prompts and the feature map from the feature container, and uses attention to produce 2D masks for the given view.
  • Figure 3: Comparison with SA3D and ISRF on the Bonsai. SANeRF-HQ can produce accurate segmentation around boundaries.
  • Figure 4: Comparison with SA3D and ISRF on the Garden. SANeRF-HQ can preserve structure details of the table.
  • Figure 5: Ablation Study on the Mask Aggregator. The red points in the RGB images represent the prompts we use in the experiments. By leveraging the 3D geometry, SANeRF-HQ can produce more accurate segmentation (the first two rows). Moreover, our method can maintain the consistency since multi-view information is fused in the object field (the last two rows).
  • ...and 11 more figures