Table of Contents
Fetching ...

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

Muer Tie, Julong Wei, Zhengjun Wang, Ke Wu, Shansuai Yuan, Kaizhao Zhang, Jie Jia, Jieru Zhao, Zhongxue Gan, Wenchao Ding

TL;DR

This work tackles online open-vocabulary scene understanding for robotics by integrating language features into a voxel-based neural implicit mapping framework. It introduces O2V-mapping, a voxel-based open-vocabulary field that supports online updates, object-level segmentation, and CLIP/SAM-based language grounding, augmented by a retrieval map and an LLM-driven interactive agent. Core contributions include a multi-grid O2V field with online depth/color and semantic decoding, an adaptive voxel-splitting and multi-view voting scheme for robust open-set semantics, and an LLM-centered interaction protocol for grounded reasoning. Empirical results show state-of-the-art performance in open-vocabulary localization and segmentation with favorable speed, demonstrating practical potential for real-time robotic perception and planning.

Abstract

Online construction of open-ended language scenes is crucial for robotic applications, where open-vocabulary interactive scene understanding is required. Recently, neural implicit representation has provided a promising direction for online interactive mapping. However, implementing open-vocabulary scene understanding capability into online neural implicit mapping still faces three challenges: lack of local scene updating ability, blurry spatial hierarchical semantic segmentation and difficulty in maintaining multi-view consistency. To this end, we proposed O2V-mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field, thus allowing for local updates during online training process. Additionally, we leverage a foundational model for image segmentation to extract language features on object-level entities, achieving clear segmentation boundaries and hierarchical semantic features. For the purpose of preserving consistency in 3D object properties across different viewpoints, we propose a spatial adaptive voxel adjustment mechanism and a multi-view weight selection method. Extensive experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-mapping achieves online construction of language scenes while enhancing accuracy, outperforming the previous SOTA method.

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

TL;DR

This work tackles online open-vocabulary scene understanding for robotics by integrating language features into a voxel-based neural implicit mapping framework. It introduces O2V-mapping, a voxel-based open-vocabulary field that supports online updates, object-level segmentation, and CLIP/SAM-based language grounding, augmented by a retrieval map and an LLM-driven interactive agent. Core contributions include a multi-grid O2V field with online depth/color and semantic decoding, an adaptive voxel-splitting and multi-view voting scheme for robust open-set semantics, and an LLM-centered interaction protocol for grounded reasoning. Empirical results show state-of-the-art performance in open-vocabulary localization and segmentation with favorable speed, demonstrating practical potential for real-time robotic perception and planning.

Abstract

Online construction of open-ended language scenes is crucial for robotic applications, where open-vocabulary interactive scene understanding is required. Recently, neural implicit representation has provided a promising direction for online interactive mapping. However, implementing open-vocabulary scene understanding capability into online neural implicit mapping still faces three challenges: lack of local scene updating ability, blurry spatial hierarchical semantic segmentation and difficulty in maintaining multi-view consistency. To this end, we proposed O2V-mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field, thus allowing for local updates during online training process. Additionally, we leverage a foundational model for image segmentation to extract language features on object-level entities, achieving clear segmentation boundaries and hierarchical semantic features. For the purpose of preserving consistency in 3D object properties across different viewpoints, we propose a spatial adaptive voxel adjustment mechanism and a multi-view weight selection method. Extensive experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-mapping achieves online construction of language scenes while enhancing accuracy, outperforming the previous SOTA method.
Paper Structure (18 sections, 11 equations, 7 figures, 1 table)

This paper contains 18 sections, 11 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Online Open-vocabulary Mapping. O2V-mapping allows for online open-set text queries while constructing dense open-set semantic scenes, enabling the spatial localization of queried objects. Compared to LERF, it exhibits clearer object boundaries and more concentrated probability distributions of relevance.
  • Figure 2: The overview of our pipeline. Top: Optimization of voxel-based neural radiance fields. Nearest trilinear interpolation is used to obtain color and geometric features for spatially sampled points. Then, leveraging NeRF's volume rendering, the scene is sampled and average-rendered to produce RGB and depth images. Bottom: Optimization of our O2V filed. We employ SAM to segment input RGB images and obtain instances. We further obtain language features through CLIP encoding. Feature indexing is performed to prepare for feature fusion. Finally, voxel splitting and multi-perspective voting are adopted to obtain fine-grained 3D open-vocabulary results.
  • Figure 3: Left: The agent uses a search tree to roll out solutions, and optimizes the solution under the global scene based on feedback from the O2V map. For instance, at the "go to table" node, it queries the key object "table", which returns a high relevance, symbolizing the feasibility of this action. Right: For semantic content carried by the "display", it cannot be directly obtained through a query. Therefore, the agent first queries the carrier of semantic information and renders "diaplay" images. Then, the VLM can understand the content displayed is "world maps".
  • Figure 4: Our method is compared with LERF on 7 text query objects in 3 scenes. The relevance probabilities rendered by our method are concentrated around the queried objects, demonstrating clear boundary quality on objects of different shapes and sizes. Refer to \ref{['section4.1']} for a discussion and detailed information on relevance visualization.
  • Figure 5: Our method was tested on three scenes and 18 object types from the ScanNet dataset, and it still performed well in real scenes with depth noise.
  • ...and 2 more figures