Table of Contents
Fetching ...

Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs

Yanyuan Qiao, Wenqi Lyu, Hui Wang, Zixu Wang, Zerui Li, Yuan Zhang, Mingkui Tan, Qi Wu

TL;DR

Open-Nav tackles zero-shot Vision-and-Language Navigation in continuous environments by leveraging locally deployed open-source LLMs and privacy-preserving inference. It introduces a spatial-temporal chain-of-thought framework that decomposes instruction comprehension, progress estimation, and decision making, coupled with enhanced scene perception using SpatialBot and RAM and a waypoint predictor. The method shows competitive performance relative to GPT-4-based navigators in both simulated and real-world VLN-CE settings, while offering privacy preservation and lower operational cost. This work suggests that open-source LLMs can effectively guide embodied navigation when paired with structured perception and reasoning modules, enabling scalable real-world deployment.

Abstract

Vision-and-Language Navigation (VLN) tasks require an agent to follow textual instructions to navigate through 3D environments. Traditional approaches use supervised learning methods, relying heavily on domain-specific datasets to train VLN models. Recent methods try to utilize closed-source large language models (LLMs) like GPT-4 to solve VLN tasks in zero-shot manners, but face challenges related to expensive token costs and potential data breaches in real-world applications. In this work, we introduce Open-Nav, a novel study that explores open-source LLMs for zero-shot VLN in the continuous environment. Open-Nav employs a spatial-temporal chain-of-thought (CoT) reasoning approach to break down tasks into instruction comprehension, progress estimation, and decision-making. It enhances scene perceptions with fine-grained object and spatial knowledge to improve LLM's reasoning in navigation. Our extensive experiments in both simulated and real-world environments demonstrate that Open-Nav achieves competitive performance compared to using closed-source LLMs.

Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs

TL;DR

Open-Nav tackles zero-shot Vision-and-Language Navigation in continuous environments by leveraging locally deployed open-source LLMs and privacy-preserving inference. It introduces a spatial-temporal chain-of-thought framework that decomposes instruction comprehension, progress estimation, and decision making, coupled with enhanced scene perception using SpatialBot and RAM and a waypoint predictor. The method shows competitive performance relative to GPT-4-based navigators in both simulated and real-world VLN-CE settings, while offering privacy preservation and lower operational cost. This work suggests that open-source LLMs can effectively guide embodied navigation when paired with structured perception and reasoning modules, enabling scalable real-world deployment.

Abstract

Vision-and-Language Navigation (VLN) tasks require an agent to follow textual instructions to navigate through 3D environments. Traditional approaches use supervised learning methods, relying heavily on domain-specific datasets to train VLN models. Recent methods try to utilize closed-source large language models (LLMs) like GPT-4 to solve VLN tasks in zero-shot manners, but face challenges related to expensive token costs and potential data breaches in real-world applications. In this work, we introduce Open-Nav, a novel study that explores open-source LLMs for zero-shot VLN in the continuous environment. Open-Nav employs a spatial-temporal chain-of-thought (CoT) reasoning approach to break down tasks into instruction comprehension, progress estimation, and decision-making. It enhances scene perceptions with fine-grained object and spatial knowledge to improve LLM's reasoning in navigation. Our extensive experiments in both simulated and real-world environments demonstrate that Open-Nav achieves competitive performance compared to using closed-source LLMs.
Paper Structure (20 sections, 5 equations, 6 figures, 3 tables)

This paper contains 20 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison between GPT-based Navigator and open-source LLM-based Navigator. The GPT-based Navigator requires continuous queries to the GPT model via API for navigation, incurring high costs and necessitating the transmission of environmental data to servers, which raises privacy concerns. In contrast, the open-source LLM-based Navigator utilizes locally deployed LLMs, which are not only free but also safeguard user privacy by eliminating the need to transmit sensitive data.
  • Figure 2: Overview of Open-Nav. The Waypoint Prediction module uses panoramic RGB and depth images to pinpoint potential navigation waypoints, which are then analyzed by the Scene Perception module. This module processes these images to determine object locations and spatial relationships and recognize scene elements. This data is then converted into a text format. The LLM Navigator performs tasks in three stages: understanding instructions, estimating progress, and making decisions. Finally, the navigator determines the precise location to navigate to and performs the corresponding actions.
  • Figure 3: The arrangement of the real-world environment.
  • Figure 4: Visualization of our method Open-Nav in a real environment. The right side of the picture shows LLM's thoughts during navigation.
  • Figure 5: Performance of open-source LLMs on action decomposition.
  • ...and 1 more figures