Table of Contents
Fetching ...

STRIDER: Navigation via Instruction-Aligned Structural Decision Space Optimization

Diqi He, Xuehao Gao, Hao Li, Junwei Han, Dingwen Zhang

TL;DR

We address zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) by proposing STRIDER, a framework that optimizes the agent's decision space through instruction-aligned structural planning. STRIDER combines a Structured Waypoint Generator, which derives a layout-constrained set of waypoints from depth-based skeletons, with a Task-Aligned Regulator that uses progress feedback to steer execution via an instruction-aware loop. Across R2R-CE and RxR-CE, STRIDER yields significant gains in core metrics such as SR and NDTW, validating the value of spatial priors and feedback-guided regulation for long-horizon, language-grounded navigation. The results demonstrate STRIDER's model-agnostic robustness and show complementary benefits when applied to fine-tuned systems, highlighting practical potential for zero-shot embodied navigation with scalable multimodal reasoning.

Abstract

The Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) task requires agents to navigate previously unseen 3D environments using natural language instructions, without any scene-specific training. A critical challenge in this setting lies in ensuring agents' actions align with both spatial structure and task intent over long-horizon execution. Existing methods often fail to achieve robust navigation due to a lack of structured decision-making and insufficient integration of feedback from previous actions. To address these challenges, we propose STRIDER (Instruction-Aligned Structural Decision Space Optimization), a novel framework that systematically optimizes the agent's decision space by integrating spatial layout priors and dynamic task feedback. Our approach introduces two key innovations: 1) a Structured Waypoint Generator that constrains the action space through spatial structure, and 2) a Task-Alignment Regulator that adjusts behavior based on task progress, ensuring semantic alignment throughout navigation. Extensive experiments on the R2R-CE and RxR-CE benchmarks demonstrate that STRIDER significantly outperforms strong SOTA across key metrics; in particular, it improves Success Rate (SR) from 29% to 35%, a relative gain of 20.7%. Such results highlight the importance of spatially constrained decision-making and feedback-guided execution in improving navigation fidelity for zero-shot VLN-CE.

STRIDER: Navigation via Instruction-Aligned Structural Decision Space Optimization

TL;DR

We address zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) by proposing STRIDER, a framework that optimizes the agent's decision space through instruction-aligned structural planning. STRIDER combines a Structured Waypoint Generator, which derives a layout-constrained set of waypoints from depth-based skeletons, with a Task-Aligned Regulator that uses progress feedback to steer execution via an instruction-aware loop. Across R2R-CE and RxR-CE, STRIDER yields significant gains in core metrics such as SR and NDTW, validating the value of spatial priors and feedback-guided regulation for long-horizon, language-grounded navigation. The results demonstrate STRIDER's model-agnostic robustness and show complementary benefits when applied to fine-tuned systems, highlighting practical potential for zero-shot embodied navigation with scalable multimodal reasoning.

Abstract

The Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) task requires agents to navigate previously unseen 3D environments using natural language instructions, without any scene-specific training. A critical challenge in this setting lies in ensuring agents' actions align with both spatial structure and task intent over long-horizon execution. Existing methods often fail to achieve robust navigation due to a lack of structured decision-making and insufficient integration of feedback from previous actions. To address these challenges, we propose STRIDER (Instruction-Aligned Structural Decision Space Optimization), a novel framework that systematically optimizes the agent's decision space by integrating spatial layout priors and dynamic task feedback. Our approach introduces two key innovations: 1) a Structured Waypoint Generator that constrains the action space through spatial structure, and 2) a Task-Alignment Regulator that adjusts behavior based on task progress, ensuring semantic alignment throughout navigation. Extensive experiments on the R2R-CE and RxR-CE benchmarks demonstrate that STRIDER significantly outperforms strong SOTA across key metrics; in particular, it improves Success Rate (SR) from 29% to 35%, a relative gain of 20.7%. Such results highlight the importance of spatially constrained decision-making and feedback-guided execution in improving navigation fidelity for zero-shot VLN-CE.

Paper Structure

This paper contains 30 sections, 7 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Navigation behavior comparison between STRIDER and Open-Nav qiao2024open. Given the same instruction, Open-Nav demonstrates execution drift, such as prematurely turning away from a hallway, and accumulates deviations over time. In contrast, STRIDER generates trajectories that more accurately follow the intended path and reach the goal region.
  • Figure 2: Overview of the STRIDER pipeline. The Structured Waypoint Generator constructs a layout-constrained waypoint space by extracting skeleton paths from navigable depth observations. The agent performs perception and reasoning over visual descriptions and feedback to identify suitable actions in context. To maintain semantic alignment over time, the Task-Alignment Regulator compares current and previous observations and generates feedback that guides the next action.
  • Figure 3: Structured waypoint selection based on skeleton. We categorize skeleton nodes by their degree and select only degree-1 endpoints as candidate waypoints.
  • Figure 4: Comparison between original waypoint predictor and Structured Waypoint Generator. Our Structured Waypoint Generator extracts layout-consistent waypoints that align with the environment's topology, resulting in trajectories that are more goal-directed and spatially coherent.
  • Figure 5: Comparison of agent behavior under no-feedback and feedback-driven execution strategies. Without feedback, the agent prematurely infers task completion, resulting in an incorrect action (Action 2). With feedback, the agent leverages the intermediate state to refine its understanding, yielding a more semantically consistent action (Action 1) aligned with the instruction.
  • ...and 9 more figures