Table of Contents
Fetching ...

Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

Zihan Gao, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Yuwei Guo, Shuyuan Yang

TL;DR

MaskField introduces mask-level distillation for 3D open-vocabulary segmentation using neural fields, addressing the inefficiency and boundary ambiguity of per-pixel CLIP feature distillation. By decoupling shape and semantic information into a mask field with scene-level queries and leveraging SAM-provided boundaries, it achieves multi-view-consistent segmentation without explicit high-dimensional 3D CLIP features. The method yields state-of-the-art results on open-vocabulary tasks across NeRF and 3DGS backbones while significantly reducing training time (about 5 minutes) and memory demands, due to reduced feature dimensionality and boundary-aware supervision. This approach broadens practical applicability of 3D open-vocabulary segmentation and suggests that mask-based supervision can simplify 3D understanding derived from 2D foundation models.

Abstract

Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enable open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. However, while effective, these methods typically rely on the per-pixel distillation of high-dimensional CLIP features, introducing ambiguity and necessitating complex regularization strategies, which adds inefficiency during training. This paper presents MaskField, which enables efficient 3D open-vocabulary segmentation with neural fields from a novel perspective. Unlike previous methods, MaskField decomposes the distillation of mask and semantic features from foundation models by formulating a mask feature field and queries. MaskField overcomes ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of dense high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models.

Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

TL;DR

MaskField introduces mask-level distillation for 3D open-vocabulary segmentation using neural fields, addressing the inefficiency and boundary ambiguity of per-pixel CLIP feature distillation. By decoupling shape and semantic information into a mask field with scene-level queries and leveraging SAM-provided boundaries, it achieves multi-view-consistent segmentation without explicit high-dimensional 3D CLIP features. The method yields state-of-the-art results on open-vocabulary tasks across NeRF and 3DGS backbones while significantly reducing training time (about 5 minutes) and memory demands, due to reduced feature dimensionality and boundary-aware supervision. This approach broadens practical applicability of 3D open-vocabulary segmentation and suggests that mask-based supervision can simplify 3D understanding derived from 2D foundation models.

Abstract

Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enable open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. However, while effective, these methods typically rely on the per-pixel distillation of high-dimensional CLIP features, introducing ambiguity and necessitating complex regularization strategies, which adds inefficiency during training. This paper presents MaskField, which enables efficient 3D open-vocabulary segmentation with neural fields from a novel perspective. Unlike previous methods, MaskField decomposes the distillation of mask and semantic features from foundation models by formulating a mask feature field and queries. MaskField overcomes ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of dense high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models.
Paper Structure (19 sections, 5 equations, 9 figures, 7 tables)

This paper contains 19 sections, 5 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: (a) Extracting pixel-aligned CLIP features from image crops shows ambiguity around object boundary. (b) PCA visualization of LERF CLIP feature field. (c) Our method produces clear object boundary (d) Previous method extracts dense CLIP features and performs distillation at pixel-level. (e) Our proposed MaskField performs mask-level distillation to avoid handling the pixel-aligned high-dimensional ambiguous CLIP feature during training.
  • Figure 2: An overview of the proposed MaskField. Given a set of multi-view images, our method distills the open-vocabulary knowledge from CLIP at a mask level. Our method naturally introduces region boundaries segmented by SAM without the need for complex regularization during training.
  • Figure 3: Qualitative comparisons of 2 different scenes in LERF-Mask dataset. Our method successfully gives the most accurate object segmentation.
  • Figure 4: Qualitative comparisons of 2 different scenes in 3DOVS dataset. Our method successfully recognizes long-tailed objects and gives the most accurate segmentation maps.
  • Figure 5: Qualitative comparisons of 2 different scenes in Replica dataset. Our method successfully recognizes objects in complex geometry and gives the most accurate segmentation maps.
  • ...and 4 more figures