Online Embedding Multi-Scale CLIP Features into 3D Maps
Shun Taguchi, Hideki Deguchi
TL;DR
The paper tackles open-vocabulary semantic mapping for 3D maps to support language-based queries in unknown environments. It introduces CLIPMapper, an online system that embeds multi-scale CLIP features into 3D maps by patching RGB images at multiple scales and batch-processing patches for a single forward pass, then back-projects depth to world coordinates to populate a semantically rich map. The approach enables offline retrieval via text queries and supports a zero-shot object-goal navigation pipeline that uses CLIP-based localization to guide exploration with a multi-goal planner. Across simulation (Habitat) and real-robot experiments (Vizbot), the method achieves faster mapping and higher success rates, including for objects outside standard COCO vocabularies, and demonstrates robust open-vocabulary object retrieval and multi-object navigation, illustrating practical impact for autonomous navigation and semantic mapping.
Abstract
This study introduces a novel approach to online embedding of multi-scale CLIP (Contrastive Language-Image Pre-Training) features into 3D maps. By harnessing CLIP, this methodology surpasses the constraints of conventional vocabulary-limited methods and enables the incorporation of semantic information into the resultant maps. While recent approaches have explored the embedding of multi-modal features in maps, they often impose significant computational costs, lacking practicality for exploring unfamiliar environments in real time. Our approach tackles these challenges by efficiently computing and embedding multi-scale CLIP features, thereby facilitating the exploration of unfamiliar environments through real-time map generation. Moreover, the embedding CLIP features into the resultant maps makes offline retrieval via linguistic queries feasible. In essence, our approach simultaneously achieves real-time object search and mapping of unfamiliar environments. Additionally, we propose a zero-shot object-goal navigation system based on our mapping approach, and we validate its efficacy through object-goal navigation, offline object retrieval, and multi-object-goal navigation in both simulated environments and real robot experiments. The findings demonstrate that our method not only exhibits swifter performance than state-of-the-art mapping methods but also surpasses them in terms of the success rate of object-goal navigation tasks.
