Table of Contents
Fetching ...

OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models

Yuxuan Kuang, Hai Lin, Meng Jiang

TL;DR

OpenFMNav tackles open-set, free-form instruction-driven object navigation by fusing language and vision foundation models. It uses ProposeLLM to hypothesize object candidates, DiscoverVLM to opportunistically find scene objects, and PerceptVLM to detect and segment them, all feeding a live Versatile Semantic Score Map (VSSM). A ReasonLLM-based reasoning loop selects frontier goals on the VSSM, guiding a Fast Marching Method planner to generate low-level actions, enabling zero-shot navigation in unseen environments. Experiments on HM3D ObjectNav show state-of-the-art performance and zero-shot generalization, with real-robot demonstrations validating real-world applicability and open-set reasoning capabilities.

Abstract

Object navigation (ObjectNav) requires an agent to navigate through unseen environments to find queried objects. Many previous methods attempted to solve this task by relying on supervised or reinforcement learning, where they are trained on limited household datasets with close-set objects. However, two key challenges are unsolved: understanding free-form natural language instructions that demand open-set objects, and generalizing to new environments in a zero-shot manner. Aiming to solve the two challenges, in this paper, we propose OpenFMNav, an Open-set Foundation Model based framework for zero-shot object Navigation. We first unleash the reasoning abilities of large language models (LLMs) to extract proposed objects from natural language instructions that meet the user's demand. We then leverage the generalizability of large vision language models (VLMs) to actively discover and detect candidate objects from the scene, building a Versatile Semantic Score Map (VSSM). Then, by conducting common sense reasoning on VSSM, our method can perform effective language-guided exploration and exploitation of the scene and finally reach the goal. By leveraging the reasoning and generalizing abilities of foundation models, our method can understand free-form human instructions and perform effective open-set zero-shot navigation in diverse environments. Extensive experiments on the HM3D ObjectNav benchmark show that our method surpasses all the strong baselines on all metrics, proving our method's effectiveness. Furthermore, we perform real robot demonstrations to validate our method's open-set-ness and generalizability to real-world environments.

OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models

TL;DR

OpenFMNav tackles open-set, free-form instruction-driven object navigation by fusing language and vision foundation models. It uses ProposeLLM to hypothesize object candidates, DiscoverVLM to opportunistically find scene objects, and PerceptVLM to detect and segment them, all feeding a live Versatile Semantic Score Map (VSSM). A ReasonLLM-based reasoning loop selects frontier goals on the VSSM, guiding a Fast Marching Method planner to generate low-level actions, enabling zero-shot navigation in unseen environments. Experiments on HM3D ObjectNav show state-of-the-art performance and zero-shot generalization, with real-robot demonstrations validating real-world applicability and open-set reasoning capabilities.

Abstract

Object navigation (ObjectNav) requires an agent to navigate through unseen environments to find queried objects. Many previous methods attempted to solve this task by relying on supervised or reinforcement learning, where they are trained on limited household datasets with close-set objects. However, two key challenges are unsolved: understanding free-form natural language instructions that demand open-set objects, and generalizing to new environments in a zero-shot manner. Aiming to solve the two challenges, in this paper, we propose OpenFMNav, an Open-set Foundation Model based framework for zero-shot object Navigation. We first unleash the reasoning abilities of large language models (LLMs) to extract proposed objects from natural language instructions that meet the user's demand. We then leverage the generalizability of large vision language models (VLMs) to actively discover and detect candidate objects from the scene, building a Versatile Semantic Score Map (VSSM). Then, by conducting common sense reasoning on VSSM, our method can perform effective language-guided exploration and exploitation of the scene and finally reach the goal. By leveraging the reasoning and generalizing abilities of foundation models, our method can understand free-form human instructions and perform effective open-set zero-shot navigation in diverse environments. Extensive experiments on the HM3D ObjectNav benchmark show that our method surpasses all the strong baselines on all metrics, proving our method's effectiveness. Furthermore, we perform real robot demonstrations to validate our method's open-set-ness and generalizability to real-world environments.
Paper Structure (26 sections, 8 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Leveraging foundation models, our proposed OpenFMNav can follow free-form natural language instructions with open-set objects and achieve effective zero-shot object navigation.
  • Figure 2: The framework of our proposed OpenFMNav. Based on the natural language instruction and observations, we utilize foundation models to interpret human instructions and construct a Versatile Semantic Score Map (VSSM), on which we perform common sense reasoning and scoring to conduct language-guided frontier-based exploration.
  • Figure 3: Types and percentages of failure cases in ablation methods.
  • Figure 4: Qualitative studies in the real world. Text marked in red indicates objects that potentially satisfy the instruction. Results show that our method is robust to natural language instructions, including distractors, open-set objects and free-form demands.
  • Figure 5: Initial prior objects $O_{pri}$
  • ...and 3 more figures