O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

Muer Tie; Julong Wei; Zhengjun Wang; Ke Wu; Shansuai Yuan; Kaizhao Zhang; Jie Jia; Jieru Zhao; Zhongxue Gan; Wenchao Ding

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

Muer Tie, Julong Wei, Zhengjun Wang, Ke Wu, Shansuai Yuan, Kaizhao Zhang, Jie Jia, Jieru Zhao, Zhongxue Gan, Wenchao Ding

TL;DR

This work tackles online open-vocabulary scene understanding for robotics by integrating language features into a voxel-based neural implicit mapping framework. It introduces O2V-mapping, a voxel-based open-vocabulary field that supports online updates, object-level segmentation, and CLIP/SAM-based language grounding, augmented by a retrieval map and an LLM-driven interactive agent. Core contributions include a multi-grid O2V field with online depth/color and semantic decoding, an adaptive voxel-splitting and multi-view voting scheme for robust open-set semantics, and an LLM-centered interaction protocol for grounded reasoning. Empirical results show state-of-the-art performance in open-vocabulary localization and segmentation with favorable speed, demonstrating practical potential for real-time robotic perception and planning.

Abstract

Online construction of open-ended language scenes is crucial for robotic applications, where open-vocabulary interactive scene understanding is required. Recently, neural implicit representation has provided a promising direction for online interactive mapping. However, implementing open-vocabulary scene understanding capability into online neural implicit mapping still faces three challenges: lack of local scene updating ability, blurry spatial hierarchical semantic segmentation and difficulty in maintaining multi-view consistency. To this end, we proposed O2V-mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field, thus allowing for local updates during online training process. Additionally, we leverage a foundational model for image segmentation to extract language features on object-level entities, achieving clear segmentation boundaries and hierarchical semantic features. For the purpose of preserving consistency in 3D object properties across different viewpoints, we propose a spatial adaptive voxel adjustment mechanism and a multi-view weight selection method. Extensive experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-mapping achieves online construction of language scenes while enhancing accuracy, outperforming the previous SOTA method.

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

TL;DR

Abstract

Paper Structure (18 sections, 11 equations, 7 figures, 1 table)

This paper contains 18 sections, 11 equations, 7 figures, 1 table.

Introduction
Related Work
Methods
Online Construction of Open-Vocabulary Field
Voxel Implicit Representation with Multiple Categories.
Object-Level Open-Vocabulary Mapping.
Language Feature Fusion
Voxel Splitting.
Multi-view Voting.
Querying O2V Field
Establishment of Retrieval Map.
Interacting with Large Language Model (LLM).
Results
Qualitative Results
Query Robustness
...and 3 more sections

Figures (7)

Figure 1: Online Open-vocabulary Mapping. O2V-mapping allows for online open-set text queries while constructing dense open-set semantic scenes, enabling the spatial localization of queried objects. Compared to LERF, it exhibits clearer object boundaries and more concentrated probability distributions of relevance.
Figure 2: The overview of our pipeline. Top: Optimization of voxel-based neural radiance fields. Nearest trilinear interpolation is used to obtain color and geometric features for spatially sampled points. Then, leveraging NeRF's volume rendering, the scene is sampled and average-rendered to produce RGB and depth images. Bottom: Optimization of our O2V filed. We employ SAM to segment input RGB images and obtain instances. We further obtain language features through CLIP encoding. Feature indexing is performed to prepare for feature fusion. Finally, voxel splitting and multi-perspective voting are adopted to obtain fine-grained 3D open-vocabulary results.
Figure 3: Left: The agent uses a search tree to roll out solutions, and optimizes the solution under the global scene based on feedback from the O2V map. For instance, at the "go to table" node, it queries the key object "table", which returns a high relevance, symbolizing the feasibility of this action. Right: For semantic content carried by the "display", it cannot be directly obtained through a query. Therefore, the agent first queries the carrier of semantic information and renders "diaplay" images. Then, the VLM can understand the content displayed is "world maps".
Figure 4: Our method is compared with LERF on 7 text query objects in 3 scenes. The relevance probabilities rendered by our method are concentrated around the queried objects, demonstrating clear boundary quality on objects of different shapes and sizes. Refer to \ref{['section4.1']} for a discussion and detailed information on relevance visualization.
Figure 5: Our method was tested on three scenes and 18 object types from the ScanNet dataset, and it still performed well in real scenes with depth noise.
...and 2 more figures

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

TL;DR

Abstract

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)