Table of Contents
Fetching ...

Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

Shuo Wang, Yucheng Wang, Guoxin Lian, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Yutian Zhou, Wanting Li, Deying Li, Zhaoxin Fan

TL;DR

Progress-Think tackles long-horizon Vision-Language Navigation by introducing semantic progress reasoning that explicitly tracks instruction progress from accumulated visual observations. It introduces an annotation-free three-stage pipeline—Self-Aligned Progress Pretraining, Progress-Guided Policy Pretraining, and Progress-Policy Co-Finetuning—to learn progress representations and align actions with remaining instruction semantics. The approach achieves state-of-the-art results on R2R-CE and RxR-CE without external data, improving SR, SPL, and interpretability while maintaining a reasonable computational footprint. By grounding navigation decisions in monotonic progress signals, Progress-Think offers a principled framework for robust, coherent long-horizon embodied reasoning with practical implications for real-world navigation systems.

Abstract

Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction. However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences. Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation. To achieve this without expensive annotations, we propose a three-stage framework. In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes. Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions. Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives. Experiments on R2R-CE and RxR-CE show state-of-the-art success and efficiency, demonstrating that semantic progress yields a more consistent representation of navigation advancement.

Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

TL;DR

Progress-Think tackles long-horizon Vision-Language Navigation by introducing semantic progress reasoning that explicitly tracks instruction progress from accumulated visual observations. It introduces an annotation-free three-stage pipeline—Self-Aligned Progress Pretraining, Progress-Guided Policy Pretraining, and Progress-Policy Co-Finetuning—to learn progress representations and align actions with remaining instruction semantics. The approach achieves state-of-the-art results on R2R-CE and RxR-CE without external data, improving SR, SPL, and interpretability while maintaining a reasonable computational footprint. By grounding navigation decisions in monotonic progress signals, Progress-Think offers a principled framework for robust, coherent long-horizon embodied reasoning with practical implications for real-world navigation systems.

Abstract

Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction. However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences. Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation. To achieve this without expensive annotations, we propose a three-stage framework. In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes. Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions. Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives. Experiments on R2R-CE and RxR-CE show state-of-the-art success and efficiency, demonstrating that semantic progress yields a more consistent representation of navigation advancement.

Paper Structure

This paper contains 24 sections, 14 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Our key structural insight in VLN: visual observations and instruction semantics exhibit monotonic co-progression. As observations accumulate (top), the aligned instruction prefix extends monotonically over time (bottom), with later progress (red) consistently building on earlier progress (blue).
  • Figure 2: Overview of the Progress-Think framework and annotation-free training pipeline. Compared with the vanilla Vision-Language-Action (VLA) model, Progress-Think introduces a Progress Reasoning Module to infer task progress and guide action generation. The model is trained in three stages: (1) Self-Aligned Progress Pretraining for progress pretraining with $\mathcal{L}_{\mathrm{SAPP}}=\mathcal{L}_{\mathrm{prefix}}+\mathcal{L}_{\mathrm{mono}}$, (2) Progress-Guided Policy Pretraining with frozen progress reasoning and supervised policy loss $\mathcal{L}_{\mathrm{Policy}}$, and (3) Progress-Policy Co-Finetuning, which jointly optimizes reasoning and policy through GRPO over groups of $N$ rollouts, using the objective $\mathcal{L}_{\mathrm{PPCF}}$.
  • Figure 3: Qualitative comparison of progress reasoning quality. Across two representative scenes, we compare how different models infer navigation progress from historical observations. GPT-4o and NVILAliu2024nvila often produce generic or instruction-misaligned descriptions and occasionally exhibit hallucinations, limiting their usefulness for tracking progress and making it difficult for the agent to align its behavior with the intended navigation steps. Our ablated variants (without monotonic loss or without Progress-Policy Co-Finetuning) capture partial progress but tend to be less consistent and concrete, leading to incomplete guidance. In contrast, the full Progress-Think model produces concise, instruction-style reasoning that adheres closely to the task and accurately reflects the agent’s evolving progress, enabling more coherent and reliable navigation.