Table of Contents
Fetching ...

N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

Yash Bhalgat, Iro Laina, João F. Henriques, Andrew Zisserman, Andrea Vedaldi

TL;DR

N2F2 introduces Nested Neural Feature Fields, a hierarchical 3D representation where different feature-dimensions encode scene properties at multiple granularities within a single feature field. The method uses scale-aware hierarchical supervision, SAM-derived segmentation, and CLIP embeddings to distill multi-scale semantic information into a unified 3D Gaussian Splatting representation, paired with a memory-efficient TriPlane+MLP feature field and deferred training rendering. A novel composite embedding aggregates scale-specific cues for open-vocabulary querying, yielding a single relevancy map per query and achieving state-of-the-art results in open-vocabulary 3D localization and segmentation with substantial speedups over prior methods. The approach demonstrates strong performance on challenging compound queries (e.g., "bag of cookies", "lid of the cup") and offers practical benefits for real-time, language-guided 3D scene understanding in robotics and AR contexts.

Abstract

Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions or semantics or both, thereby enabling a comprehensive and nuanced understanding of scenes. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space, and query the CLIP vision-encoder to obtain language-aligned embeddings for each of these segments. Our proposed hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, demonstrating the effectiveness of the learned nested feature field.

N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

TL;DR

N2F2 introduces Nested Neural Feature Fields, a hierarchical 3D representation where different feature-dimensions encode scene properties at multiple granularities within a single feature field. The method uses scale-aware hierarchical supervision, SAM-derived segmentation, and CLIP embeddings to distill multi-scale semantic information into a unified 3D Gaussian Splatting representation, paired with a memory-efficient TriPlane+MLP feature field and deferred training rendering. A novel composite embedding aggregates scale-specific cues for open-vocabulary querying, yielding a single relevancy map per query and achieving state-of-the-art results in open-vocabulary 3D localization and segmentation with substantial speedups over prior methods. The approach demonstrates strong performance on challenging compound queries (e.g., "bag of cookies", "lid of the cup") and offers practical benefits for real-time, language-guided 3D scene understanding in robotics and AR contexts.

Abstract

Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions or semantics or both, thereby enabling a comprehensive and nuanced understanding of scenes. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space, and query the CLIP vision-encoder to obtain language-aligned embeddings for each of these segments. Our proposed hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, demonstrating the effectiveness of the learned nested feature field.
Paper Structure (41 sections, 9 equations, 7 figures, 11 tables)

This paper contains 41 sections, 9 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Nested Neural Feature Fields (N2F2). We present N2F2, wherein different dimensions of the same feature field encode scene properties at varying granularities. The illustration captures the essence of hierarchical scene understanding, depicting how our model differentiates between coarse and fine scales to accurately interpret complex semantic queries, such as "donuts in a white box" and "chocolate donut", showcasing the model's versatility in handling detailed object descriptions within 3D environments.
  • Figure 2: N2F2 Overview.Left: N2F2 employs 3D Gaussian Splatting (3DGS) to represent the scene, augmented with a feature field that captures scene properties across different scales and semantic granularities. Middle: Different subsets of the same feature vectors encode scene properties at varying scales. This unified feature field is optimized using a hierarchical supervision loss applied to the scale-aware features. Right: We extract a pool of segments using SAM and pre-compute a CLIP embedding for each. Each segment is assigned a physical scale computed using the 3DGS model, which is then used to compute the scale-aware feature.
  • Figure 3: Qualitative comparisons with LangSplat qin2023langsplat on challenging compound queries.
  • Figure 4: Open-vocabulary Retrieval performance on the expanded LERF dataset from Qin et al. qin2023langsplat. $R@K$ (%) reported at $K=1,2,3,4,5$ for each scene.
  • Figure 5: Scene: teatime. Each row contains results for the text query shown on the left. Columns: N2F2 (Relevancy and Segmentation maps), LangSplat qin2023langsplat (Relevancy and Segmentation maps).
  • ...and 2 more figures