Table of Contents
Fetching ...

FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph

Xiaolin Zhou, Tingyang Xiao, Liu Liu, Yucheng Wang, Maiyue Chen, Xinrui Meng, Xinjie Wang, Wei Feng, Wei Sui, Zhizhong Su

TL;DR

FSR-VLN tackles long-range Vision-Language Navigation by introducing a Hierarchical Multi-modal Scene Graph (HMSG) and a Fast-to-Slow Reasoning (FSR) pipeline. The HMSG provides a four-level, open-vocabulary environment representation (floor, room, view, object) with multi-modal features, while fast CLIP-based grounding is refined by VLM verification via GPT-4o. The approach achieves state-of-the-art retrieval success and significantly reduces latency, by performing slow reasoning only when necessary. Integrating with speech, planning, and control on a Unitree-G1 humanoid enables natural-language guided real-world navigation.

Abstract

Visual-Language Navigation (VLN) is a fundamental challenge in robotic systems, with broad applications for the deployment of embodied agents in real-world environments. Despite recent advances, existing approaches are limited in long-range spatial reasoning, often exhibiting low success rates and high inference latency, particularly in long-range navigation tasks. To address these limitations, we propose FSR-VLN, a vision-language navigation system that combines a Hierarchical Multi-modal Scene Graph (HMSG) with Fast-to-Slow Navigation Reasoning (FSR). The HMSG provides a multi-modal map representation supporting progressive retrieval, from coarse room-level localization to fine-grained goal view and object identification. Building on HMSG, FSR first performs fast matching to efficiently select candidate rooms, views, and objects, then applies VLM-driven refinement for final goal selection. We evaluated FSR-VLN across four comprehensive indoor datasets collected by humanoid robots, utilizing 87 instructions that encompass a diverse range of object categories. FSR-VLN achieves state-of-the-art (SOTA) performance in all datasets, measured by the retrieval success rate (RSR), while reducing the response time by 82% compared to VLM-based methods on tour videos by activating slow reasoning only when fast intuition fails. Furthermore, we integrate FSR-VLN with speech interaction, planning, and control modules on a Unitree-G1 humanoid robot, enabling natural language interaction and real-time navigation.

FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph

TL;DR

FSR-VLN tackles long-range Vision-Language Navigation by introducing a Hierarchical Multi-modal Scene Graph (HMSG) and a Fast-to-Slow Reasoning (FSR) pipeline. The HMSG provides a four-level, open-vocabulary environment representation (floor, room, view, object) with multi-modal features, while fast CLIP-based grounding is refined by VLM verification via GPT-4o. The approach achieves state-of-the-art retrieval success and significantly reduces latency, by performing slow reasoning only when necessary. Integrating with speech, planning, and control on a Unitree-G1 humanoid enables natural-language guided real-world navigation.

Abstract

Visual-Language Navigation (VLN) is a fundamental challenge in robotic systems, with broad applications for the deployment of embodied agents in real-world environments. Despite recent advances, existing approaches are limited in long-range spatial reasoning, often exhibiting low success rates and high inference latency, particularly in long-range navigation tasks. To address these limitations, we propose FSR-VLN, a vision-language navigation system that combines a Hierarchical Multi-modal Scene Graph (HMSG) with Fast-to-Slow Navigation Reasoning (FSR). The HMSG provides a multi-modal map representation supporting progressive retrieval, from coarse room-level localization to fine-grained goal view and object identification. Building on HMSG, FSR first performs fast matching to efficiently select candidate rooms, views, and objects, then applies VLM-driven refinement for final goal selection. We evaluated FSR-VLN across four comprehensive indoor datasets collected by humanoid robots, utilizing 87 instructions that encompass a diverse range of object categories. FSR-VLN achieves state-of-the-art (SOTA) performance in all datasets, measured by the retrieval success rate (RSR), while reducing the response time by 82% compared to VLM-based methods on tour videos by activating slow reasoning only when fast intuition fails. Furthermore, we integrate FSR-VLN with speech interaction, planning, and control modules on a Unitree-G1 humanoid robot, enabling natural language interaction and real-time navigation.

Paper Structure

This paper contains 13 sections, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: System Overview. The proposed humanoid robotics navigation system integrates HMSG with FSR to achieve view/object-level real-world long-range navigation. Specifically, RGBD and pose data are first utilized to construct HMSG, which provides a hierarchical and multimodal feature-based representation of the environment. During online interaction, the user’s text or voice input is converted into instructions via voice activity detection and speech recognition, and the LLM infers the target object. Based on the HMSG, fast-matching and slow VLM reasoning jointly identify the optimal goal view/object. The identified goals are subsequently used by the global path planning.
  • Figure 2: HMSG representation. Our proposed HMSG is a four-level hierarchy: floor, room, view, and object nodes. Each node contains multi-modal features, including geometric attributes, semantic attributes, and topological relationships.
  • Figure 3: The navigation reasoning follows a coarse-to-fine process: 1). LLM interprets user instructions into structured object/room queries; 2). CLIP-based fast matching, as intuition retrieves candidate goal rooms, views, and objects. 3). VLM-based slow reasoning refines the candidate results to ensure accurate goal view and object selection.
  • Figure 4: The goal view and object retrieval results of FSR-VLN for four different instructions (Reasoning-Free, Reasoning-Required, Small Object, and Spatial Target) in Room4 (40mx20m).