Table of Contents
Fetching ...

Fast-SmartWay: Panoramic-Free End-to-End Zero-Shot Vision-and-Language Navigation

Xiangyu Shi, Zerui Li, Yanyuan Qiao, Qi Wu

TL;DR

This work tackles zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) under realistic constraints by removing panoramic sensing and waypoint predictors. It introduces Fast-SmartWay, an end-to-end framework that uses only three frontal RGB-D views and a multimodal large language model to directly predict navigation actions, complemented by Spatial-Semantic Textual Description Generation. To improve robustness, the approach adds an Uncertainty-Aware Reasoning module with a Disambiguation component and a Future-Past Bidirectional Reasoning (FPBR) mechanism, enabling dynamic reorientation and coherent long-horizon planning without retraining. Experiments in both simulated and real-world settings demonstrate faster per-step latency while achieving competitive or superior navigation performance compared to panoramic baselines, highlighting practical deployability for real robots. Overall, the method presents a scalable, end-to-end solution that leverages multimodal reasoning to balance efficiency and robustness in zero-shot embodied navigation.

Abstract

Recent advances in Vision-and-Language Navigation in Continuous Environments (VLN-CE) have leveraged multimodal large language models (MLLMs) to achieve zero-shot navigation. However, existing methods often rely on panoramic observations and two-stage pipelines involving waypoint predictors, which introduce significant latency and limit real-world applicability. In this work, we propose Fast-SmartWay, an end-to-end zero-shot VLN-CE framework that eliminates the need for panoramic views and waypoint predictors. Our approach uses only three frontal RGB-D images combined with natural language instructions, enabling MLLMs to directly predict actions. To enhance decision robustness, we introduce an Uncertainty-Aware Reasoning module that integrates (i) a Disambiguation Module for avoiding local optima, and (ii) a Future-Past Bidirectional Reasoning mechanism for globally coherent planning. Experiments on both simulated and real-robot environments demonstrate that our method significantly reduces per-step latency while achieving competitive or superior performance compared to panoramic-view baselines. These results demonstrate the practicality and effectiveness of Fast-SmartWay for real-world zero-shot embodied navigation.

Fast-SmartWay: Panoramic-Free End-to-End Zero-Shot Vision-and-Language Navigation

TL;DR

This work tackles zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) under realistic constraints by removing panoramic sensing and waypoint predictors. It introduces Fast-SmartWay, an end-to-end framework that uses only three frontal RGB-D views and a multimodal large language model to directly predict navigation actions, complemented by Spatial-Semantic Textual Description Generation. To improve robustness, the approach adds an Uncertainty-Aware Reasoning module with a Disambiguation component and a Future-Past Bidirectional Reasoning (FPBR) mechanism, enabling dynamic reorientation and coherent long-horizon planning without retraining. Experiments in both simulated and real-world settings demonstrate faster per-step latency while achieving competitive or superior navigation performance compared to panoramic baselines, highlighting practical deployability for real robots. Overall, the method presents a scalable, end-to-end solution that leverages multimodal reasoning to balance efficiency and robustness in zero-shot embodied navigation.

Abstract

Recent advances in Vision-and-Language Navigation in Continuous Environments (VLN-CE) have leveraged multimodal large language models (MLLMs) to achieve zero-shot navigation. However, existing methods often rely on panoramic observations and two-stage pipelines involving waypoint predictors, which introduce significant latency and limit real-world applicability. In this work, we propose Fast-SmartWay, an end-to-end zero-shot VLN-CE framework that eliminates the need for panoramic views and waypoint predictors. Our approach uses only three frontal RGB-D images combined with natural language instructions, enabling MLLMs to directly predict actions. To enhance decision robustness, we introduce an Uncertainty-Aware Reasoning module that integrates (i) a Disambiguation Module for avoiding local optima, and (ii) a Future-Past Bidirectional Reasoning mechanism for globally coherent planning. Experiments on both simulated and real-robot environments demonstrate that our method significantly reduces per-step latency while achieving competitive or superior performance compared to panoramic-view baselines. These results demonstrate the practicality and effectiveness of Fast-SmartWay for real-world zero-shot embodied navigation.

Paper Structure

This paper contains 19 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison between previous VLN-CE pipelines and our proposed end-to-end framework. Top: Previous two-stage methods first feed panoramic RGB-D observations (12 views) into a waypoint predictor to generate candidate waypoints. In the second stage, the candidate waypoints, together with the instruction, are provided to the navigator to select the final navigation action. Bottom: Our framework integrates the instruction with three frontal-view images in an Uncertainty-Aware Navigator, directly predicting navigation actions in a single end-to-end step.
  • Figure 2: Overall workflow of our proposed zero-shot VLN-CE framework. The navigation process begins with panoramic RGB-D observations at the initial or disambiguation stage, and uses three frontal RGB-D views during the step-wise stage. The system first constructs structured prompts, then extracts spatial and semantic descriptions, and finally sends both together to the Multimodal Large Language Model (MLLM). Within the Decision Process, the MLLM performs step-wise reasoning and incorporates Future-Past Bidirectional Reasoning (FPBR) to ensure globally consistent planning. It also determines whether the robot is confused based on the instruction and context. If so, the Disambiguation Module is triggered to collect a panoramic observation and replan. The robot used in the figure is a Hello Robot HelloRobot2025 equipped with an Intel RealSense camera mounted at a height of 125 cm.
  • Figure 3: Overview of the structured Decision Process in the MLLM-based Step-wise navigation pipeline. Upon receiving inputs, the model evaluates candidate views, predicts future observations, analyses the previous action, determines stop conditions, and estimates safe moving distance with selected direction.