Table of Contents
Fetching ...

Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments

Kehan Chen, Dong An, Yan Huang, Rongtao Xu, Yifei Su, Yonggen Ling, Ian Reid, Liang Wang

TL;DR

This work tackles zero-shot Vision-Language Navigation in Continuous Environments (VLN-CE) by introducing CA-Nav, a training-free framework that operates on egocentric observations. CA-Nav splits instructions into sub-instructions and grounds each via a Constraint-Aware Sub-instruction Manager (CSM) and a Constraint-Aware Value Mapper (CVM), which together enable constraint-guided progress and a stabilized value map for waypoint planning. The approach achieves state-of-the-art success rates on R2R-CE and RxR-CE validation unseen splits and demonstrates practical viability in real-world indoor robots, while offering substantial efficiency gains over prior zero-shot methods. These findings highlight the potential of constraint-aware grounding and value-map optimization to bridge simulated VLN-CE benchmarks and real-world embodied navigation, especially in open-vocabulary and unannotated settings.

Abstract

We address the task of Vision-Language Navigation in Continuous Environments (VLN-CE) under the zero-shot setting. Zero-shot VLN-CE is particularly challenging due to the absence of expert demonstrations for training and minimal environment structural prior to guide navigation. To confront these challenges, we propose a Constraint-Aware Navigator (CA-Nav), which reframes zero-shot VLN-CE as a sequential, constraint-aware sub-instruction completion process. CA-Nav continuously translates sub-instructions into navigation plans using two core modules: the Constraint-Aware Sub-instruction Manager (CSM) and the Constraint-Aware Value Mapper (CVM). CSM defines the completion criteria for decomposed sub-instructions as constraints and tracks navigation progress by switching sub-instructions in a constraint-aware manner. CVM, guided by CSM's constraints, generates a value map on the fly and refines it using superpixel clustering to improve navigation stability. CA-Nav achieves the state-of-the-art performance on two VLN-CE benchmarks, surpassing the previous best method by 12 percent and 13 percent in Success Rate on the validation unseen splits of R2R-CE and RxR-CE, respectively. Moreover, CA-Nav demonstrates its effectiveness in real-world robot deployments across various indoor scenes and instructions.

Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments

TL;DR

This work tackles zero-shot Vision-Language Navigation in Continuous Environments (VLN-CE) by introducing CA-Nav, a training-free framework that operates on egocentric observations. CA-Nav splits instructions into sub-instructions and grounds each via a Constraint-Aware Sub-instruction Manager (CSM) and a Constraint-Aware Value Mapper (CVM), which together enable constraint-guided progress and a stabilized value map for waypoint planning. The approach achieves state-of-the-art success rates on R2R-CE and RxR-CE validation unseen splits and demonstrates practical viability in real-world indoor robots, while offering substantial efficiency gains over prior zero-shot methods. These findings highlight the potential of constraint-aware grounding and value-map optimization to bridge simulated VLN-CE benchmarks and real-world embodied navigation, especially in open-vocabulary and unannotated settings.

Abstract

We address the task of Vision-Language Navigation in Continuous Environments (VLN-CE) under the zero-shot setting. Zero-shot VLN-CE is particularly challenging due to the absence of expert demonstrations for training and minimal environment structural prior to guide navigation. To confront these challenges, we propose a Constraint-Aware Navigator (CA-Nav), which reframes zero-shot VLN-CE as a sequential, constraint-aware sub-instruction completion process. CA-Nav continuously translates sub-instructions into navigation plans using two core modules: the Constraint-Aware Sub-instruction Manager (CSM) and the Constraint-Aware Value Mapper (CVM). CSM defines the completion criteria for decomposed sub-instructions as constraints and tracks navigation progress by switching sub-instructions in a constraint-aware manner. CVM, guided by CSM's constraints, generates a value map on the fly and refines it using superpixel clustering to improve navigation stability. CA-Nav achieves the state-of-the-art performance on two VLN-CE benchmarks, surpassing the previous best method by 12 percent and 13 percent in Success Rate on the validation unseen splits of R2R-CE and RxR-CE, respectively. Moreover, CA-Nav demonstrates its effectiveness in real-world robot deployments across various indoor scenes and instructions.

Paper Structure

This paper contains 16 sections, 8 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: The zero-shot VLN-CE task (bottom right) addresses the dual challenge of prior scarcity along two orthogonal dimensions: the absence of expert demonstrations for training and limited structural prior to guide navigation.
  • Figure 2: Illustration of the proposed CA-Nav. (a) The Constraint-aware Sub-instruction Manager decomposes the instruction into a sequence of sub-instructions and identifies object constraints, location constraints and direction constraints for each of them. (b) During navigation, a Constraint-aware Value Mapper builds a value map based on the landmark prompt provided by CSM and uses the superpixel clustering method to segment it into regions. It switches sub-instructions in a constraint-aware manner and chooses the most promising region's geometric center as waypoints.
  • Figure 3: An overall pipeline of CA-Nav. The details of the Constraint-aware Value Map Generation are shown in Figure \ref{['fig:vlmap']}.
  • Figure 4: Details of the Constraint-aware Value Map Generation.
  • Figure 5: Comparison between CA-Nav and NavGPT-CE.
  • ...and 9 more figures