Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

Shuo Wang; Yucheng Wang; Guoxin Lian; Yongcai Wang; Maiyue Chen; Kaihui Wang; Bo Zhang; Zhizhong Su; Yutian Zhou; Wanting Li; Deying Li; Zhaoxin Fan

Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

Shuo Wang, Yucheng Wang, Guoxin Lian, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Yutian Zhou, Wanting Li, Deying Li, Zhaoxin Fan

TL;DR

Progress-Think tackles long-horizon Vision-Language Navigation by introducing semantic progress reasoning that explicitly tracks instruction progress from accumulated visual observations. It introduces an annotation-free three-stage pipeline—Self-Aligned Progress Pretraining, Progress-Guided Policy Pretraining, and Progress-Policy Co-Finetuning—to learn progress representations and align actions with remaining instruction semantics. The approach achieves state-of-the-art results on R2R-CE and RxR-CE without external data, improving SR, SPL, and interpretability while maintaining a reasonable computational footprint. By grounding navigation decisions in monotonic progress signals, Progress-Think offers a principled framework for robust, coherent long-horizon embodied reasoning with practical implications for real-world navigation systems.

Abstract

Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction. However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences. Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation. To achieve this without expensive annotations, we propose a three-stage framework. In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes. Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions. Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives. Experiments on R2R-CE and RxR-CE show state-of-the-art success and efficiency, demonstrating that semantic progress yields a more consistent representation of navigation advancement.

Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

TL;DR

Abstract

Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)