FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph

Xiaolin Zhou; Tingyang Xiao; Liu Liu; Yucheng Wang; Maiyue Chen; Xinrui Meng; Xinjie Wang; Wei Feng; Wei Sui; Zhizhong Su

FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph

Xiaolin Zhou, Tingyang Xiao, Liu Liu, Yucheng Wang, Maiyue Chen, Xinrui Meng, Xinjie Wang, Wei Feng, Wei Sui, Zhizhong Su

TL;DR

FSR-VLN tackles long-range Vision-Language Navigation by introducing a Hierarchical Multi-modal Scene Graph (HMSG) and a Fast-to-Slow Reasoning (FSR) pipeline. The HMSG provides a four-level, open-vocabulary environment representation (floor, room, view, object) with multi-modal features, while fast CLIP-based grounding is refined by VLM verification via GPT-4o. The approach achieves state-of-the-art retrieval success and significantly reduces latency, by performing slow reasoning only when necessary. Integrating with speech, planning, and control on a Unitree-G1 humanoid enables natural-language guided real-world navigation.

Abstract

Visual-Language Navigation (VLN) is a fundamental challenge in robotic systems, with broad applications for the deployment of embodied agents in real-world environments. Despite recent advances, existing approaches are limited in long-range spatial reasoning, often exhibiting low success rates and high inference latency, particularly in long-range navigation tasks. To address these limitations, we propose FSR-VLN, a vision-language navigation system that combines a Hierarchical Multi-modal Scene Graph (HMSG) with Fast-to-Slow Navigation Reasoning (FSR). The HMSG provides a multi-modal map representation supporting progressive retrieval, from coarse room-level localization to fine-grained goal view and object identification. Building on HMSG, FSR first performs fast matching to efficiently select candidate rooms, views, and objects, then applies VLM-driven refinement for final goal selection. We evaluated FSR-VLN across four comprehensive indoor datasets collected by humanoid robots, utilizing 87 instructions that encompass a diverse range of object categories. FSR-VLN achieves state-of-the-art (SOTA) performance in all datasets, measured by the retrieval success rate (RSR), while reducing the response time by 82% compared to VLM-based methods on tour videos by activating slow reasoning only when fast intuition fails. Furthermore, we integrate FSR-VLN with speech interaction, planning, and control modules on a Unitree-G1 humanoid robot, enabling natural language interaction and real-time navigation.

FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph

TL;DR

Abstract

FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)