EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning

Bingqian Lin; Yunshuang Nie; Khun Loun Zai; Ziming Wei; Mingfei Han; Rongtao Xu; Minzhe Niu; Jianhua Han; Hanwang Zhang; Liang Lin; Bokui Chen; Cewu Lu; Xiaodan Liang

EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning

Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Hanwang Zhang, Liang Lin, Bokui Chen, Cewu Lu, Xiaodan Liang

TL;DR

EvolveNav tackles vision-language navigation by endowing open-source LLMs with self-improving embodied reasoning. It introduces a two-stage training pipeline: Stage 1 uses formalized CoT labels to activate navigational reasoning and accelerate inference, while Stage 2 uses self-generated CoT outputs and a self-reflective task to diversify supervision and curb overfitting. Ablation and extensive benchmark results show that formalized CoT labels plus self-reflective post-training yield robust improvements across R2R, REVERIE, CVDN, and SOON, in both task-specific and cross-task settings. The approach improves navigation accuracy and interpretability, offering a scalable path for open-source LLM-based VLN systems with strong generalization to unseen environments.

Abstract

Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for enhancing vision-language navigation (VLN) performance, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches predominantly adopt straightforward input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. To address these issues, we propose EvolveNav, a novel sElf-improving embodied reasoning paradigm that realizes adaptable and generalizable navigational reasoning for boosting LLM-based vision-language Navigation. Specifically, EvolveNav involves a two-stage training process: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with curated formalized CoT labels to first activate the model's navigational reasoning capabilities, and simultaneously increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also designed to encourage the model to learn correct reasoning patterns by contrasting with wrong ones. Experimental results under both task-specific and cross-task training paradigms demonstrate the consistent superiority of EvolveNav over previous LLM-based VLN approaches on various popular benchmarks, including R2R, REVERIE, CVDN, and SOON. Code is available at https://github.com/expectorlin/EvolveNav.

EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning

TL;DR

Abstract

EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)