Table of Contents
Fetching ...

EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning

Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Hanwang Zhang, Liang Lin, Bokui Chen, Cewu Lu, Xiaodan Liang

TL;DR

EvolveNav tackles vision-language navigation by endowing open-source LLMs with self-improving embodied reasoning. It introduces a two-stage training pipeline: Stage 1 uses formalized CoT labels to activate navigational reasoning and accelerate inference, while Stage 2 uses self-generated CoT outputs and a self-reflective task to diversify supervision and curb overfitting. Ablation and extensive benchmark results show that formalized CoT labels plus self-reflective post-training yield robust improvements across R2R, REVERIE, CVDN, and SOON, in both task-specific and cross-task settings. The approach improves navigation accuracy and interpretability, offering a scalable path for open-source LLM-based VLN systems with strong generalization to unseen environments.

Abstract

Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for enhancing vision-language navigation (VLN) performance, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches predominantly adopt straightforward input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. To address these issues, we propose EvolveNav, a novel sElf-improving embodied reasoning paradigm that realizes adaptable and generalizable navigational reasoning for boosting LLM-based vision-language Navigation. Specifically, EvolveNav involves a two-stage training process: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with curated formalized CoT labels to first activate the model's navigational reasoning capabilities, and simultaneously increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also designed to encourage the model to learn correct reasoning patterns by contrasting with wrong ones. Experimental results under both task-specific and cross-task training paradigms demonstrate the consistent superiority of EvolveNav over previous LLM-based VLN approaches on various popular benchmarks, including R2R, REVERIE, CVDN, and SOON. Code is available at https://github.com/expectorlin/EvolveNav.

EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning

TL;DR

EvolveNav tackles vision-language navigation by endowing open-source LLMs with self-improving embodied reasoning. It introduces a two-stage training pipeline: Stage 1 uses formalized CoT labels to activate navigational reasoning and accelerate inference, while Stage 2 uses self-generated CoT outputs and a self-reflective task to diversify supervision and curb overfitting. Ablation and extensive benchmark results show that formalized CoT labels plus self-reflective post-training yield robust improvements across R2R, REVERIE, CVDN, and SOON, in both task-specific and cross-task settings. The approach improves navigation accuracy and interpretability, offering a scalable path for open-source LLM-based VLN systems with strong generalization to unseen environments.

Abstract

Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for enhancing vision-language navigation (VLN) performance, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches predominantly adopt straightforward input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. To address these issues, we propose EvolveNav, a novel sElf-improving embodied reasoning paradigm that realizes adaptable and generalizable navigational reasoning for boosting LLM-based vision-language Navigation. Specifically, EvolveNav involves a two-stage training process: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with curated formalized CoT labels to first activate the model's navigational reasoning capabilities, and simultaneously increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also designed to encourage the model to learn correct reasoning patterns by contrasting with wrong ones. Experimental results under both task-specific and cross-task training paradigms demonstrate the consistent superiority of EvolveNav over previous LLM-based VLN approaches on various popular benchmarks, including R2R, REVERIE, CVDN, and SOON. Code is available at https://github.com/expectorlin/EvolveNav.

Paper Structure

This paper contains 21 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison of different chain-of-thought (CoT) training paradigms. (a) Direct Mapping Learning maps the navigation inputs to actions straightforwardly. (b) Formalized CoT Learning and (c) Free-form CoT Learning generate formalized and free-form reasoning, respectively, under the training with fixed CoT labels. (d) Different from the above paradigms, our Self-Improving CoT Learning framework utilizes the model's own reasoning outputs as self-enriched CoT labels and learn the reasoning in a self-reflective way during CoT training to fulfill generalizable and adaptable reasoning. Red and green fonts represent wrong and correct reasoning outputs, respectively. R+ and R- represent positive and negative reasoning samples, respectively.
  • Figure 2: Overview of EvolveNav. EvolveNav involves a two-phase training framework for fulfilling self-improving embodied reasoning. In Stage 1 Formalized CoT Supervised Fine-Tuning, the navigation agent is trained using pre-constructed formalized CoT labels to generate navigational reasoning by predicting the landmark needed to locate with the corresponding direction. In Stage 2 Self-Reflective Post-Training, the agent's own reasoning outputs are introduced as the self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also designed to guide the navigation agent to discriminate between correct and wrong reasoning outputs.
  • Figure 3: Action decision visualization of NaviLLM zheng2024towards and our EvolveNav. We only extract two steps and display local candidate space for simplicity. Observations selected by EvolveNav (also are the ground-truth actions) and NaviLLM are annotated by green boxes and red boxes, respectively.
  • Figure 4: Visualization comparison between self-enriched chain-of-thought (CoT) labels and originally built CoT labels. Newly introduced landmarks in the self-enriched CoT label are highlighted in red fonts. GT action denotes the ground-truth action (observation). We omit the direction information in the CoT labels.
  • Figure 5: Loss and performance variation during Stage 2: Self-Reflective Post-Training. Low navigation error (NE) value indicates better results.
  • ...and 1 more figures