Table of Contents
Fetching ...

SFCo-Nav: Efficient Zero-Shot Visual Language Navigation via Collaboration of Slow LLM and Fast Attributed Graph Alignment

Chaoran Xiong, Litao Wei, Xinhao Hu, Kehui Ma, Ziyi Xia, Zixin Jiang, Zhen Sun, Ling Pei

TL;DR

SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence and matches or exceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster.

Abstract

Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero-shot approaches to visual language navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero-shot methods typically construct a naive observation graph and perform per-step VLM-LLM inference on it, resulting in high latency and computation costs that limit real-time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow-fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow-fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo-Nav matches or exceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster. Finally, we demonstrate SFCo-Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.

SFCo-Nav: Efficient Zero-Shot Visual Language Navigation via Collaboration of Slow LLM and Fast Attributed Graph Alignment

TL;DR

SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence and matches or exceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster.

Abstract

Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero-shot approaches to visual language navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero-shot methods typically construct a naive observation graph and perform per-step VLM-LLM inference on it, resulting in high latency and computation costs that limit real-time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow-fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow-fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo-Nav matches or exceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster. Finally, we demonstrate SFCo-Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.
Paper Structure (28 sections, 9 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 28 sections, 9 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: SFCo-Nav is an efficient zero-shot VLN framework inspired by slow–fast cognitive collaboration. It comprises three modules: 1) a slow brain LLM-based planner; 2) a fast brain reactive navigator; and 3) a lightweight asynchronous slow–fast bridge that aligns the imagined and perceived graphs, estimates navigation confidence, and triggers LLM only when necessary. This design minimizes costly LLM calls while preserving high navigation success.
  • Figure 2: System overview of SFCo‑Nav, a slow–fast collaborative framework for efficient zero-shot visual language navigation. The Slow LLM Planner ($\Pi_{\mathrm{slow}}$) decomposes the navigation instruction into subgoals, each paired with an imagined object graph $G^{i}_t$. The Fast Reactive Navigator ($\pi_{\mathrm{fast}}$) builds a perceived object graph $G^{p}_t$ in real time and executes low-level actions to align with $G^{i}_t$. The Slow–Fast Bridge computes the graph-alignment confidence $C_t$; high confidence ($C_t > \tau_C$) keeps control with $\pi_{\mathrm{fast}}$, while low confidence ($C_t \leq \tau_C$) triggers replanning by $\Pi_{\mathrm{slow}}$.
  • Figure 3: Slow LLM Planner prompt structure and operation process.
  • Figure 4: Observed and imagined attributed graph structure.
  • Figure 5: Real-world hotel suite deployment of SFCo-Nav. Early in navigation, sparse observations yield low match probability, triggering the slow planner. As observations grow, confidence exceeds the threshold, enabling fast, LLM-free execution. This slow-fast collaboration preserves success while improving time efficiency and reducing token usage.