Table of Contents
Fetching ...

Zero-shot Object Navigation with Vision-Language Models Reasoning

Congcong Wen, Yisiyuan Huang, Hao Huang, Yanjia Huang, Shuaihang Yuan, Yu Hao, Hui Lin, Yu-Shen Liu, Yi Fang

TL;DR

A novel Vision Language model with a Tree-of-thought Network (VLTNet) for L-ZSON, innovatively using the ToT reasoning framework for navigation frontier selection during robot exploration, enabling globally informed decision-making with higher accuracy.

Abstract

Object navigation is crucial for robots, but traditional methods require substantial training data and cannot be generalized to unknown environments. Zero-shot object navigation (ZSON) aims to address this challenge, allowing robots to interact with unknown objects without specific training data. Language-driven zero-shot object navigation (L-ZSON) is an extension of ZSON that incorporates natural language instructions to guide robot navigation and interaction with objects. In this paper, we propose a novel Vision Language model with a Tree-of-thought Network (VLTNet) for L-ZSON. VLTNet comprises four main modules: vision language model understanding, semantic mapping, tree-of-thought reasoning and exploration, and goal identification. Among these modules, Tree-of-Thought (ToT) reasoning and exploration module serves as a core component, innovatively using the ToT reasoning framework for navigation frontier selection during robot exploration. Compared to conventional frontier selection without reasoning, navigation using ToT reasoning involves multi-path reasoning processes and backtracking when necessary, enabling globally informed decision-making with higher accuracy. Experimental results on PASTURE and RoboTHOR benchmarks demonstrate the outstanding performance of our model in LZSON, particularly in scenarios involving complex natural language as target instructions.

Zero-shot Object Navigation with Vision-Language Models Reasoning

TL;DR

A novel Vision Language model with a Tree-of-thought Network (VLTNet) for L-ZSON, innovatively using the ToT reasoning framework for navigation frontier selection during robot exploration, enabling globally informed decision-making with higher accuracy.

Abstract

Object navigation is crucial for robots, but traditional methods require substantial training data and cannot be generalized to unknown environments. Zero-shot object navigation (ZSON) aims to address this challenge, allowing robots to interact with unknown objects without specific training data. Language-driven zero-shot object navigation (L-ZSON) is an extension of ZSON that incorporates natural language instructions to guide robot navigation and interaction with objects. In this paper, we propose a novel Vision Language model with a Tree-of-thought Network (VLTNet) for L-ZSON. VLTNet comprises four main modules: vision language model understanding, semantic mapping, tree-of-thought reasoning and exploration, and goal identification. Among these modules, Tree-of-Thought (ToT) reasoning and exploration module serves as a core component, innovatively using the ToT reasoning framework for navigation frontier selection during robot exploration. Compared to conventional frontier selection without reasoning, navigation using ToT reasoning involves multi-path reasoning processes and backtracking when necessary, enabling globally informed decision-making with higher accuracy. Experimental results on PASTURE and RoboTHOR benchmarks demonstrate the outstanding performance of our model in LZSON, particularly in scenarios involving complex natural language as target instructions.

Paper Structure

This paper contains 24 sections, 4 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison of different object navigation methods under two types of language input: 1) word input with only object category, 2) sentence input with detailed spatial descriptions. a) FBE model Yamauchi: cannot accept either word or sentence input. b) ESC model Zhou_2023: only accepts word input. c) Our model: accepts both word and sentence as input.
  • Figure 2: Illustation of our VLTNet framework. During navigation, the Vision Language Model (VLM) Understanding module obtains the observed objects by parsing the current RGB observations of an agent. Based on the object locations provided by both the VLM Understanding module and depth observations from the agent, the Semantic Mapping module reconstructs a semantic navigation map containing rooms, objects, and frontiers. Conditioned on the navigation instruction and semantic navigation map, the agent then performs common sense reasoning via the Tree of Thoughts Reasoning and Exploration module to infer the most probable location of the goal object, and select the corresponding frontier to explore. Upon the VLM Understanding module grounding a candidate object in the same category as the goal object, the Goal Identification module further verifies if the candidate object reached by the agent matches the description from the navigation instruction.
  • Figure 3: Visualizing egocentric trajectories of VLTNet and ESC navigation process when given a spatial goal instruction. Color indicates trajectory progress, where blue indicating trajectory start and white indicating trajectory end. The goal objects are boxed in green, while distractors are boxed in red.