O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation
Muer Tie, Julong Wei, Zhengjun Wang, Ke Wu, Shansuai Yuan, Kaizhao Zhang, Jie Jia, Jieru Zhao, Zhongxue Gan, Wenchao Ding
TL;DR
This work tackles online open-vocabulary scene understanding for robotics by integrating language features into a voxel-based neural implicit mapping framework. It introduces O2V-mapping, a voxel-based open-vocabulary field that supports online updates, object-level segmentation, and CLIP/SAM-based language grounding, augmented by a retrieval map and an LLM-driven interactive agent. Core contributions include a multi-grid O2V field with online depth/color and semantic decoding, an adaptive voxel-splitting and multi-view voting scheme for robust open-set semantics, and an LLM-centered interaction protocol for grounded reasoning. Empirical results show state-of-the-art performance in open-vocabulary localization and segmentation with favorable speed, demonstrating practical potential for real-time robotic perception and planning.
Abstract
Online construction of open-ended language scenes is crucial for robotic applications, where open-vocabulary interactive scene understanding is required. Recently, neural implicit representation has provided a promising direction for online interactive mapping. However, implementing open-vocabulary scene understanding capability into online neural implicit mapping still faces three challenges: lack of local scene updating ability, blurry spatial hierarchical semantic segmentation and difficulty in maintaining multi-view consistency. To this end, we proposed O2V-mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field, thus allowing for local updates during online training process. Additionally, we leverage a foundational model for image segmentation to extract language features on object-level entities, achieving clear segmentation boundaries and hierarchical semantic features. For the purpose of preserving consistency in 3D object properties across different viewpoints, we propose a spatial adaptive voxel adjustment mechanism and a multi-view weight selection method. Extensive experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-mapping achieves online construction of language scenes while enhancing accuracy, outperforming the previous SOTA method.
