Table of Contents
Fetching ...

VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, Chang Liu

TL;DR

VoroNav tackles zero-shot object navigation by marrying a Reduced Voronoi Graph–based topological map with multimodal scene descriptions and LLM-guided reasoning. The framework builds a semantic RVG from real-time maps, generates path and farsight textual descriptions, and leverages GPT-3.5 to evaluate candidate waypoints, balancing exploration, efficiency, and commonsense cues. Empirical results on HM3D and HSSD show state-of-the-art improvements in success rate and path-length efficiency, with ablation and planning studies underscoring the value of combining path and farsight information. The approach demonstrates how structured topological planning and language-model reasoning can yield safer, more efficient zero-shot navigation in complex indoor environments.

Abstract

In the realm of household robotics, the Zero-Shot Object Navigation (ZSON) task empowers agents to adeptly traverse unfamiliar environments and locate objects from novel categories without prior explicit training. This paper introduces VoroNav, a novel semantic exploration framework that proposes the Reduced Voronoi Graph to extract exploratory paths and planning nodes from a semantic map constructed in real time. By harnessing topological and semantic information, VoroNav designs text-based descriptions of paths and images that are readily interpretable by a large language model (LLM). In particular, our approach presents a synergy of path and farsight descriptions to represent the environmental context, enabling LLM to apply commonsense reasoning to ascertain waypoints for navigation. Extensive evaluation on HM3D and HSSD validates VoroNav surpasses existing benchmarks in both success rate and exploration efficiency (absolute improvement: +2.8% Success and +3.7% SPL on HM3D, +2.6% Success and +3.8% SPL on HSSD). Additionally introduced metrics that evaluate obstacle avoidance proficiency and perceptual efficiency further corroborate the enhancements achieved by our method in ZSON planning. Project page: https://voro-nav.github.io

VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

TL;DR

VoroNav tackles zero-shot object navigation by marrying a Reduced Voronoi Graph–based topological map with multimodal scene descriptions and LLM-guided reasoning. The framework builds a semantic RVG from real-time maps, generates path and farsight textual descriptions, and leverages GPT-3.5 to evaluate candidate waypoints, balancing exploration, efficiency, and commonsense cues. Empirical results on HM3D and HSSD show state-of-the-art improvements in success rate and path-length efficiency, with ablation and planning studies underscoring the value of combining path and farsight information. The approach demonstrates how structured topological planning and language-model reasoning can yield safer, more efficient zero-shot navigation in complex indoor environments.

Abstract

In the realm of household robotics, the Zero-Shot Object Navigation (ZSON) task empowers agents to adeptly traverse unfamiliar environments and locate objects from novel categories without prior explicit training. This paper introduces VoroNav, a novel semantic exploration framework that proposes the Reduced Voronoi Graph to extract exploratory paths and planning nodes from a semantic map constructed in real time. By harnessing topological and semantic information, VoroNav designs text-based descriptions of paths and images that are readily interpretable by a large language model (LLM). In particular, our approach presents a synergy of path and farsight descriptions to represent the environmental context, enabling LLM to apply commonsense reasoning to ascertain waypoints for navigation. Extensive evaluation on HM3D and HSSD validates VoroNav surpasses existing benchmarks in both success rate and exploration efficiency (absolute improvement: +2.8% Success and +3.7% SPL on HM3D, +2.6% Success and +3.8% SPL on HSSD). Additionally introduced metrics that evaluate obstacle avoidance proficiency and perceptual efficiency further corroborate the enhancements achieved by our method in ZSON planning. Project page: https://voro-nav.github.io
Paper Structure (27 sections, 7 equations, 13 figures, 4 tables, 2 algorithms)

This paper contains 27 sections, 7 equations, 13 figures, 4 tables, 2 algorithms.

Figures (13)

  • Figure 1: Voronoi-based Navigation with LLM. Our model focuses on optimizing the decision-making process in ZSON. It enables the agent to pinpoint intersections rich in observation on the map by Voronoi sparsification, which act as navigation waypoints. The agent perceives the environment at intersections, collects scene information from nearby waypoints, and performs reasoning guided by LLM to ascertain the most plausible waypoint leading to the desired target. The five images presented in (a) depict the agent's corresponding perspectives as it faces five adjacent navigation waypoints at the intersection illustrated in (b), with the indices showing the correspondence.
  • Figure 2: Components of VoroNav. VoroNav includes three modules. Perceptual inputs include RGB-D images and real-time pose, while the output of the agent is "Action". The RGB-D and pose observation are processed by the Semantic Mapping Module (light blue module) to form a semantic map. The Global Decision Module (light yellow module) generates RVG, which is used to produce textual descriptions of surrounding neighbor nodes and exploratory paths. This module then employs an LLM to assist in selecting the promising neighbor node as a mid-term goal by inferring the fused prompt of scene descriptions. The Local Policy Module (light green module) plans the low-level actions of the agent to reach the target point.
  • Figure 3: Commonsense Reasoning with LLM. (a) LLM analyzes the objects and their coordinates that appear on the path and depicts the scene along the path. (b) LLM predicts the probability of the target object appearing in each area by comprehending the fused text descriptions of the scene.
  • Figure 4: Farsight Image Captioning. The agent selects all RGB images that capture the views of neighbor nodes and uses BLIP to generate captions of these images.
  • Figure 5: Simulation Experiments. Utilizing LLM, the agent explores efficiently, discovers the target with a minimal path cost, and finally navigates to the target object with success. In this figure, we visualize the RGB images and semantic maps of the four global decision instances, and the dialog box on the left exhibits the conversation between the agent and LLM in the first global decision process.
  • ...and 8 more figures