Table of Contents
Fetching ...

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

Tianyi Ma, Yue Zhang, Zehao Wang, Parisa Kordjamshidi

TL;DR

SkillNav proposes a modular Vision-and-Language Navigation framework that decomposes navigation into interpretable atomic skills, each backed by dedicated agents trained via synthetic skill-specific data. A Vision-Language Model based router dynamically selects the appropriate skill at each step, guided by a temporal reordering module that converts instructions into subgoals. The approach improves generalization to unseen environments and instruction styles, achieving strong results on R2R and state-of-the-art generalization on GSA-R2R, with detailed ablations and efficiency analysis. This work highlights the value of compositional, grounded reasoning over end-to-end methods for embodied navigation.

Abstract

Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

TL;DR

SkillNav proposes a modular Vision-and-Language Navigation framework that decomposes navigation into interpretable atomic skills, each backed by dedicated agents trained via synthetic skill-specific data. A Vision-Language Model based router dynamically selects the appropriate skill at each step, guided by a temporal reordering module that converts instructions into subgoals. The approach improves generalization to unseen environments and instruction styles, achieving strong results on R2R and state-of-the-art generalization on GSA-R2R, with detailed ablations and efficiency analysis. This work highlights the value of compositional, grounded reasoning over end-to-end methods for embodied navigation.

Abstract

Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.

Paper Structure

This paper contains 37 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: SkillNav decomposes complex navigation instructions into atomic skills, which can be flexibly recomposed to address new environments.
  • Figure 2: SkillNav Architecture. SkillNav takes visual observations, original instructions and the topological map as input. A temporal reordering module first leverages an LLM to reorder instructions into structured action goals. Subsequently, a VLM-based action router localizes the current focused sub-goal and dynamically selects the most suitable skill-based agent. For each skill, we construct specialized instruction-visual observation datasets for targeted skill learning.
  • Figure 3: Qualitative examples of routing and navigation results. These examples include cases where the instruction is temporally complex, colloquial, or spatially ambiguous.
  • Figure 4: Distribution of instructions in the R2R dataset categorized by the proposed skill taxonomy.
  • Figure 5: The statistics of the path length of our synthetic datasets compared with existing VLN datasets. The R2R, ScaleVLN, SRDF datasets, and our 6 skill-specific datasets are all for training, while only GSA-R2R is for evaluation.