Table of Contents
Fetching ...

osmAG-LLM: Zero-Shot Open-Vocabulary Object Navigation via Semantic Maps and Large Language Models Reasoning

Fujing Xie, Sören Schwertfeger, Hermann Blum

TL;DR

This work develops a mapping and navigation system for object-goal navigation that considers the possibilities that a queried object can have moved, or may not be mapped at all, and by far outperforms prior approaches in cases of dynamic or unmapped object queries.

Abstract

Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features, achieving a high level of detail and guiding robots to find objects specified by open-vocabulary language queries. While the issue of scalability for such approaches has received some attention, another fundamental problem is that high-detail object mapping quickly becomes outdated, as objects get moved around a lot. In this work, we develop a mapping and navigation system for object-goal navigation that, from the ground up, considers the possibilities that a queried object can have moved, or may not be mapped at all. Instead of striving for high-fidelity mapping detail, we consider that the main purpose of a map is to provide environment grounding and context, which we combine with the semantic priors of LLMs to reason about object locations and deploy an active, online approach to navigate to the objects. Through simulated and real-world experiments we find that our approach tends to have higher retrieval success at shorter path lengths for static objects and by far outperforms prior approaches in cases of dynamic or unmapped object queries. We provide our code and dataset at: https://github.com/xiexiexiaoxiexie/osmAG-LLM.

osmAG-LLM: Zero-Shot Open-Vocabulary Object Navigation via Semantic Maps and Large Language Models Reasoning

TL;DR

This work develops a mapping and navigation system for object-goal navigation that considers the possibilities that a queried object can have moved, or may not be mapped at all, and by far outperforms prior approaches in cases of dynamic or unmapped object queries.

Abstract

Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features, achieving a high level of detail and guiding robots to find objects specified by open-vocabulary language queries. While the issue of scalability for such approaches has received some attention, another fundamental problem is that high-detail object mapping quickly becomes outdated, as objects get moved around a lot. In this work, we develop a mapping and navigation system for object-goal navigation that, from the ground up, considers the possibilities that a queried object can have moved, or may not be mapped at all. Instead of striving for high-fidelity mapping detail, we consider that the main purpose of a map is to provide environment grounding and context, which we combine with the semantic priors of LLMs to reason about object locations and deploy an active, online approach to navigate to the objects. Through simulated and real-world experiments we find that our approach tends to have higher retrieval success at shorter path lengths for static objects and by far outperforms prior approaches in cases of dynamic or unmapped object queries. We provide our code and dataset at: https://github.com/xiexiexiaoxiexie/osmAG-LLM.

Paper Structure

This paper contains 26 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The semantic-osmAG employed in our method is a hierarchical, topometric map representation enhanced with textual semantic objects (attached to cyan nodes) and room attributes (attached to rooms). By leveraging this map with LLMs, the robot achieves efficient navigation and objects localization--even for objects absent during initial mapping phase (unmapped objects).
  • Figure 2: An overview of our method. We construct a semantic-osmAG offline by augmenting the basic osmAG with two additional keys: object-nodes (extracted via LabelMaker from RGB-D trajectory data) and viewpoint-nodes (processed by a VLM and placed along the trajectory path). When given a human query, the system uses an LLM to generate proposed geometric nodes (response nodes) based on the pre-built textual semantic-osmAG and the query. The robot then navigates to these nodes one by one using ROS move_base. Once at a response node, an open-vocabulary object detector proposes bounding boxes for the queried object, which are then checked by a VLM to verify if the object is actually present. If the object isn't found, the robot turns to capture additional perspectives. If the object still isn't detected after checking all views at that node, the robot moves to the next response node and repeats the detection process. (Cartoon robot in Online Detection module generated by ChatGPT-4V; prompt: 'robot rotating in place, taking pictures'.)
  • Figure 3: Figure (a) shows a pre-built navigation graph from werby2024hierarchical for reference, while (b)-(d) show our navigation process. With query 'couch in the living room': (b) initial setup with an osmAG-rendered occupancy map (walls/doors only), where red rectangles mark ground truth and yellow rectangles show response nodes (green circles highlighting their sequence). (c)-(d) The robot progressively perceives the environment, navigating nodes (1$\to$3) and replanning upon collisions. Our navigation strategy produces more direct paths compared to pre-built navigation graph in (a), and through progressive environmental perception, achieves long-term operation with minimal map updates required.
  • Figure 4: Real-world experimental environment consisting of: a conference room (upper left), a student office (upper right), a professor's office (lower left), a robotics lab (lower middle), and a relaxation lounge (lower right). (a) Data collected using an Apple scanner with the "3D Scanner App"; (b) the semantic-osmAG map; (c) the HOV-SG generated from the collected data.
  • Figure 5: Experimental objects used for real-world evaluation: The first row shows static objects, the second row contains relocated objects, and the final row displays unmapped objects absent from the map. Successful detections are shown with automatically generated bounding boxes during experiments, while text overlays indicate object names and instance counts in our environment (blue = success, red = failure).
  • ...and 2 more figures