Table of Contents
Fetching ...

OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding

Yinan Deng, Jiahui Wang, Jingyu Zhao, Jianyu Dou, Yi Yang, Yufeng Yue

TL;DR

OpenObj establishes a robust framework for efficient and watertight scene modeling and comprehension at the object level and incorporates part-level features into the object NeRF models, which not only captures object-level instances but also preserves an understanding of their internal granularity.

Abstract

In recent years, there has been a surge of interest in open-vocabulary 3D scene reconstruction facilitated by visual language models (VLMs), which showcase remarkable capabilities in open-set retrieval. However, existing methods face some limitations: they either focus on learning point-wise features, resulting in blurry semantic understanding, or solely tackle object-level reconstruction, thereby overlooking the intricate details of the object's interior. To address these challenges, we introduce OpenObj, an innovative approach to build open-vocabulary object-level Neural Radiance Fields (NeRF) with fine-grained understanding. In essence, OpenObj establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level. Moreover, we incorporate part-level features into the neural fields, enabling a nuanced representation of object interiors. This approach captures object-level instances while maintaining a fine-grained understanding. The results on multiple datasets demonstrate that OpenObj achieves superior performance in zero-shot semantic segmentation and retrieval tasks. Additionally, OpenObj supports real-world robotics tasks at multiple scales, including global movement and local manipulation.

OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding

TL;DR

OpenObj establishes a robust framework for efficient and watertight scene modeling and comprehension at the object level and incorporates part-level features into the object NeRF models, which not only captures object-level instances but also preserves an understanding of their internal granularity.

Abstract

In recent years, there has been a surge of interest in open-vocabulary 3D scene reconstruction facilitated by visual language models (VLMs), which showcase remarkable capabilities in open-set retrieval. However, existing methods face some limitations: they either focus on learning point-wise features, resulting in blurry semantic understanding, or solely tackle object-level reconstruction, thereby overlooking the intricate details of the object's interior. To address these challenges, we introduce OpenObj, an innovative approach to build open-vocabulary object-level Neural Radiance Fields (NeRF) with fine-grained understanding. In essence, OpenObj establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level. Moreover, we incorporate part-level features into the neural fields, enabling a nuanced representation of object interiors. This approach captures object-level instances while maintaining a fine-grained understanding. The results on multiple datasets demonstrate that OpenObj achieves superior performance in zero-shot semantic segmentation and retrieval tasks. Additionally, OpenObj supports real-world robotics tasks at multiple scales, including global movement and local manipulation.
Paper Structure (17 sections, 7 equations, 6 figures, 3 tables)

This paper contains 17 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The framework of OpenObj consists of four main modules: Object Segmentation and Understanding, Mask Clustering, Part-level Fine-Grained Feature Extraction, and Hierarchical Graph Representation Formation.
  • Figure 2: Two-stage mask clustering. In the coarse clustering phase, a graph is constructed for all masks, and the Louvain algorithm is applied to achieve clustering. In the fine clustering stage, the clusters are further fused according to the matched points coverage rate and color similarity of the superimposed point cloud.
  • Figure 3: Part-level fine-grained feature extraction process: The mask $m^{part}_{t,j}$ extracted by SAM is dense and may be nested. The dense masks are visually encoded using VLMs, then averaged and superimposed to produce a feature image $I^{f}_{t}$ that matches the original image size.
  • Figure 4: 2D & 3D zero-shot segmentation results. OpenObj's object-level NeRF and comprehensive understanding enable it to achieve clear boundaries and accurate semantics.
  • Figure 5: A selection of results from open-vocabulary retrieval. OpenObj correctly and clearly highlights the most relevant instance in each query.
  • ...and 1 more figures