Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation
Zihan Gao, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Yuwei Guo, Shuyuan Yang
TL;DR
MaskField introduces mask-level distillation for 3D open-vocabulary segmentation using neural fields, addressing the inefficiency and boundary ambiguity of per-pixel CLIP feature distillation. By decoupling shape and semantic information into a mask field with scene-level queries and leveraging SAM-provided boundaries, it achieves multi-view-consistent segmentation without explicit high-dimensional 3D CLIP features. The method yields state-of-the-art results on open-vocabulary tasks across NeRF and 3DGS backbones while significantly reducing training time (about 5 minutes) and memory demands, due to reduced feature dimensionality and boundary-aware supervision. This approach broadens practical applicability of 3D open-vocabulary segmentation and suggests that mask-based supervision can simplify 3D understanding derived from 2D foundation models.
Abstract
Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enable open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. However, while effective, these methods typically rely on the per-pixel distillation of high-dimensional CLIP features, introducing ambiguity and necessitating complex regularization strategies, which adds inefficiency during training. This paper presents MaskField, which enables efficient 3D open-vocabulary segmentation with neural fields from a novel perspective. Unlike previous methods, MaskField decomposes the distillation of mask and semantic features from foundation models by formulating a mask feature field and queries. MaskField overcomes ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of dense high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models.
