OMCL: Open-vocabulary Monte Carlo Localization
Evgenii Kruzhkov, Raphael Memmesheimer, Sven Behnke
TL;DR
<p>OMCL tackles robust localization when maps are built from diverse sensors by grounding pose estimation in open-vocabulary visual–language features rather than fixed semantic categories. It builds an Octree Language Map that stores CLIP-like features and uses ray tracing to evaluate observation-map consistency within a Monte Carlo Localization framework, enabling cross-modal mapping and localization with open-set prompts for global initialization. The main contributions include two mapping options to construct the language map, a cosine-similarity–based measurement model with stratified ray sampling, and a prompt-augmented initialization method that accelerates global localization. Evaluation across Matterport3D, Replica, and SemanticKITTI demonstrates strong indoor/outdoor generalization, superior localization accuracy to semantic baselines, and tangible gains from prompt-guided initialization and sampling strategies.</p>
Abstract
Robust robot localization is an important prerequisite for navigation planning. If the environment map was created from different sensors, robot measurements must be robustly associated with map features. In this work, we extend Monte Carlo Localization using vision-language features. These open-vocabulary features enable to robustly compute the likelihood of visual observations, given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds. The abstract vision-language features enable to associate observations and map elements from different modalities. Global localization can be initialized by natural language descriptions of the objects present in the vicinity of locations. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.
