Table of Contents
Fetching ...

One Map to Find Them All: Real-time Open-Vocabulary Mapping for Zero-shot Multi-Object Navigation

Finn Lukas Busch, Timon Homberger, Jesús Ortega-Peimbert, Quantao Yang, Olov Andersson

TL;DR

This work tackles real-time open-vocabulary multi-object navigation by building a reusable semantic belief map (OneMap) that accumulates CLIP-aligned patch features with quantified uncertainty. It combines dense, patch-level semantic extraction via the SED/CLIP pipeline with a probabilistic 2D map update (including depth-derived uncertainty and feature leakage) and a Kalman-based fusion to produce a queryable map. Navigation is driven by a frontier-based exploration strategy that uses four semantic sub-maps and a CLIP-based similarity field to select informative frontiers and clusters, with consensus-filtering from an object detector to confirm detections. The approach achieves state-of-the-art or competitive results on HM3D single- and multi-object zero-shot tasks and demonstrates real-world onboard performance on a Jetson Orin AGX, illustrating practical impact for mobile robotics requiring flexible, memory-enabled search across arbitrary objects.

Abstract

The capability to efficiently search for objects in complex environments is fundamental for many real-world robot applications. Recent advances in open-vocabulary vision models have resulted in semantically-informed object navigation methods that allow a robot to search for an arbitrary object without prior training. However, these zero-shot methods have so far treated the environment as unknown for each consecutive query. In this paper we introduce a new benchmark for zero-shot multi-object navigation, allowing the robot to leverage information gathered from previous searches to more efficiently find new objects. To address this problem we build a reusable open-vocabulary feature map tailored for real-time object search. We further propose a probabilistic-semantic map update that mitigates common sources of errors in semantic feature extraction and leverage this semantic uncertainty for informed multi-object exploration. We evaluate our method on a set of object navigation tasks in both simulation as well as with a real robot, running in real-time on a Jetson Orin AGX. We demonstrate that it outperforms existing state-of-the-art approaches both on single and multi-object navigation tasks. Additional videos, code and the multi-object navigation benchmark will be available on https://finnbsch.github.io/OneMap.

One Map to Find Them All: Real-time Open-Vocabulary Mapping for Zero-shot Multi-Object Navigation

TL;DR

This work tackles real-time open-vocabulary multi-object navigation by building a reusable semantic belief map (OneMap) that accumulates CLIP-aligned patch features with quantified uncertainty. It combines dense, patch-level semantic extraction via the SED/CLIP pipeline with a probabilistic 2D map update (including depth-derived uncertainty and feature leakage) and a Kalman-based fusion to produce a queryable map. Navigation is driven by a frontier-based exploration strategy that uses four semantic sub-maps and a CLIP-based similarity field to select informative frontiers and clusters, with consensus-filtering from an object detector to confirm detections. The approach achieves state-of-the-art or competitive results on HM3D single- and multi-object zero-shot tasks and demonstrates real-world onboard performance on a Jetson Orin AGX, illustrating practical impact for mobile robotics requiring flexible, memory-enabled search across arbitrary objects.

Abstract

The capability to efficiently search for objects in complex environments is fundamental for many real-world robot applications. Recent advances in open-vocabulary vision models have resulted in semantically-informed object navigation methods that allow a robot to search for an arbitrary object without prior training. However, these zero-shot methods have so far treated the environment as unknown for each consecutive query. In this paper we introduce a new benchmark for zero-shot multi-object navigation, allowing the robot to leverage information gathered from previous searches to more efficiently find new objects. To address this problem we build a reusable open-vocabulary feature map tailored for real-time object search. We further propose a probabilistic-semantic map update that mitigates common sources of errors in semantic feature extraction and leverage this semantic uncertainty for informed multi-object exploration. We evaluate our method on a set of object navigation tasks in both simulation as well as with a real robot, running in real-time on a Jetson Orin AGX. We demonstrate that it outperforms existing state-of-the-art approaches both on single and multi-object navigation tasks. Additional videos, code and the multi-object navigation benchmark will be available on https://finnbsch.github.io/OneMap.
Paper Structure (16 sections, 6 equations, 5 figures, 3 tables)

This paper contains 16 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Given a sequence of target object queries, e.g. ("chair", "toilet", "bed"), OneMap enables an agent to perform efficient open-vocabulary multi-object navigation.
  • Figure 2: Overview of the map construction. We obtain a dense feature field with corresponding variances (a) from RGB images using the SED encoder xie2024sedsimpleencoderdecoderopenvocabulary, and project the features to 3D (b), and then to the 2D map space (c). We apply a sparse inverse Gaussian convolution (d) to account for uncertainties in the feature locations. Lastly, we update the open-vocabulary belief map (e) with the mapped features and obtain a dense, open-vocabulary queryable belief map.
  • Figure 3: The observed ($\mathcal{O}$), fully explored ($\mathcal{E}$), and non-navigable ($\sim \mathcal{N}$) map with an agent venturing into unknown areas. The blue line $\delta$ marks the frontier between $\mathcal{E}$ and $\mathcal{O}$.
  • Figure 4: Comparison of our map (left) vs VLFM's map (right) taken from an episode of the HM3D dataset as the agent enters a new room, with the corresponding RGB observation. VLFM's similarity map built for the target object couch yields less spatially accurate similarity scores than our feature map queried for the same.
  • Figure 5: SPL for the first, second, and third object goal. Our method effectively uses the information stored in the map, as evident in improved performance for later object goals.