Table of Contents
Fetching ...

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, Hengshuang Zhao

TL;DR

VLN-R1 tackles vision-language navigation in continuous environments by leveraging large vision-language models to directly map egocentric video to low-level actions. It introduces the VLN-Ego dataset and a Long-Short Memory Sampling strategy, and trains models in two stages: supervised fine-tuning and reinforcement fine-tuning with a Time-Decayed Reward guided by Group Relative Policy Optimization. Results on VLN-CE R2R and RxR show state-of-the-art performance, with small LVLMs achieving comparable or superior results to larger ones after reinforcement fine-tuning, and effective cross-domain transfer with limited RxR data. The work demonstrates that LVLMs can perform end-to-end embodied navigation with data-efficient post-training and opens avenues for more flexible, real-time navigation.

Abstract

Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions. Current language model-based navigation systems operate on discrete topological graphs, limiting path planning to predefined node connections. We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions, adopting GRPO-based training inspired by DeepSeek-R1. To enable effective training, we first construct the VLN-Ego dataset using a 3D simulator, Habitat, and propose Long-Short Memory Sampling to balance historical and current observations. While large language models can supervise complete textual instructions, they lack fine-grained action-level control. Our framework employs a two-stage training approach: a) Supervised fine-tuning (SFT) to align the model's action sequence text predictions with expert demonstrations, followed by b) Reinforcement fine-tuning (RFT) enhanced with a Time-Decayed Reward (TDR) mechanism that strategically weights multi-step future actions. Experimental results show VLN-R1 achieves strong performance on VLN-CE benchmark. VLN-R1 proves LVLMs can drive embodied navigation and enhance task-specific reasoning through data-efficient, reward-driven post-training.

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

TL;DR

VLN-R1 tackles vision-language navigation in continuous environments by leveraging large vision-language models to directly map egocentric video to low-level actions. It introduces the VLN-Ego dataset and a Long-Short Memory Sampling strategy, and trains models in two stages: supervised fine-tuning and reinforcement fine-tuning with a Time-Decayed Reward guided by Group Relative Policy Optimization. Results on VLN-CE R2R and RxR show state-of-the-art performance, with small LVLMs achieving comparable or superior results to larger ones after reinforcement fine-tuning, and effective cross-domain transfer with limited RxR data. The work demonstrates that LVLMs can perform end-to-end embodied navigation with data-efficient post-training and opens avenues for more flexible, real-time navigation.

Abstract

Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions. Current language model-based navigation systems operate on discrete topological graphs, limiting path planning to predefined node connections. We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions, adopting GRPO-based training inspired by DeepSeek-R1. To enable effective training, we first construct the VLN-Ego dataset using a 3D simulator, Habitat, and propose Long-Short Memory Sampling to balance historical and current observations. While large language models can supervise complete textual instructions, they lack fine-grained action-level control. Our framework employs a two-stage training approach: a) Supervised fine-tuning (SFT) to align the model's action sequence text predictions with expert demonstrations, followed by b) Reinforcement fine-tuning (RFT) enhanced with a Time-Decayed Reward (TDR) mechanism that strategically weights multi-step future actions. Experimental results show VLN-R1 achieves strong performance on VLN-CE benchmark. VLN-R1 proves LVLMs can drive embodied navigation and enhance task-specific reasoning through data-efficient, reward-driven post-training.

Paper Structure

This paper contains 17 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of VLN-R1. Previous LLM/LVLM models were based on discrete positions and used a third-person perspective for path planning. In contrast, VLN-R1 directly explores in a continuous environment using first-person perspective videos. We train the LVLM using Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT).
  • Figure 2: Data Engine: VLN-Ego. We created a dataset named VLN-Ego for LVLM-based navigation using Habitat's virtual simulation engine. Its textual annotations primarily consist of three parts: Instruction Part, Vision Part, and Action Part.
  • Figure 3: Model Architecture of VLN-R1. VLN-R1 employs a Long-Short Memory approach for processing visual inputs. The training consists of two stages. During the supervised fine-tuning (SFT) stage, we only supervise the output text. In the reinforcement fine-tuning (RFT) stage, we implement supervision using a designed Time-Decayed Reward (TDR) mechanism.
  • Figure 4: Qualitative Results of VLN-R1. As shown, VLN-R1 accepts egocentric video input and navigates through a continuous environment to ultimately reach the target location.