Table of Contents
Fetching ...

Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

Haoyuan Li, Rui Liu, Hehe Fan, Yi Yang

TL;DR

Step-Aware Contrastive Alignment (SACA) is introduced, a framework designed to extract dense supervision from imperfect trajectories that achieves state-of-the-art performance on VLN-CE benchmarks.

Abstract

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to learn complex reasoning from long-horizon human interactions. While Multi-modal Large Language Models (MLLMs) have driven recent progress, current training paradigms struggle to balance generalization capability, error recovery and training stability. Specifically, (i) policies derived from SFT suffer from compounding errors, struggling to recover from out-of-distribution states, and (ii) Reinforcement Fine-Tuning (RFT) methods e.g. GRPO are bottlenecked by sparse outcome rewards. Their binary feedback fails to assign credit to individual steps, leading to gradient signal collapse in failure dominant batches. To address these challenges, we introduce Step-Aware Contrastive Alignment (SACA), a framework designed to extract dense supervision from imperfect trajectories. At its core, the Perception-Grounded Step-Aware auditor evaluates progress step-by-step, disentangling failed trajectories into valid prefixes and exact divergence points. Leveraging these signals, Scenario-Conditioned Group Construction mechanism dynamically routes batches to specialized resampling and optimization strategies. Extensive experiments on VLN-CE benchmarks demonstrate that SACA achieves state-of-the-art performance.

Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

TL;DR

Step-Aware Contrastive Alignment (SACA) is introduced, a framework designed to extract dense supervision from imperfect trajectories that achieves state-of-the-art performance on VLN-CE benchmarks.

Abstract

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to learn complex reasoning from long-horizon human interactions. While Multi-modal Large Language Models (MLLMs) have driven recent progress, current training paradigms struggle to balance generalization capability, error recovery and training stability. Specifically, (i) policies derived from SFT suffer from compounding errors, struggling to recover from out-of-distribution states, and (ii) Reinforcement Fine-Tuning (RFT) methods e.g. GRPO are bottlenecked by sparse outcome rewards. Their binary feedback fails to assign credit to individual steps, leading to gradient signal collapse in failure dominant batches. To address these challenges, we introduce Step-Aware Contrastive Alignment (SACA), a framework designed to extract dense supervision from imperfect trajectories. At its core, the Perception-Grounded Step-Aware auditor evaluates progress step-by-step, disentangling failed trajectories into valid prefixes and exact divergence points. Leveraging these signals, Scenario-Conditioned Group Construction mechanism dynamically routes batches to specialized resampling and optimization strategies. Extensive experiments on VLN-CE benchmarks demonstrate that SACA achieves state-of-the-art performance.
Paper Structure (21 sections, 13 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 21 sections, 13 equations, 10 figures, 7 tables, 2 algorithms.

Figures (10)

  • Figure 1: (a) Previous work discard entire trajectory upon failure due to compounding errors and sparse rewards. (b) SACA uses the PGSA auditor to pinpoint the exact divergence point ( ). Then uses Repair Resampling ( ) to recover from near-miss trajectories.
  • Figure 2: Overview of proposed SACA framework. The PGSA auditor evaluates $K$ trajectories against instruction landmarks, yielding a Soft Score for ranking and a Hard Mask to isolate the Divergence Point. Based on batch outcomes, a Scenario-Conditioned mechanism dynamically routes to either Repair Resampling (for mixed groups) or All-Failure Rescue (for null-outcome groups), followed by robust optimization.
  • Figure 3: Illustraion of the Repair Resampling process.(a) Extracting the structural mask $M_t$ via the PGSA auditor. (b) Backtracking to the Divergence Point $t_{div}$ to prune the erroneous suffix. (c) Resampling a corrective path from $t_{div}$, thereby salvaging the valid prefix and providing robust step-level supervision.
  • Figure 4: Qualitative comparison of navigation trajectories.(a) Egocentric observations. (b) Top-down maps. Green and red frames denote correct trajectory and failure trajectory, respectively.
  • Figure 5: Illustration of SR curves during RFT training.
  • ...and 5 more figures