NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

Bingqian Lin; Yunshuang Nie; Ziming Wei; Jiaqi Chen; Shikui Ma; Jianhua Han; Hang Xu; Xiaojun Chang; Xiaodan Liang

NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, Xiaodan Liang

TL;DR

Vision-Language Navigation requires long-horizon reasoning to follow natural language in 3D environments. The authors propose NavCoT, a training framework that turns LLMs into a world model and a disentangled navigational reasoning agent by generating Future Imagination, Visual Information Filter, and Action Prediction, trained in-domain with formalized CoT labels. Using parameter-efficient finetuning on open-source LLaMA-based backbones, NavCoT achieves significant gains over direct action prediction and zero-shot approaches, including surpassing a GPT-4-based baseline on R2R by ~7 points in SR/SPL. The method generalizes to RxR and REVERIE and remains effective under low-resource data; ablations confirm the necessity of FI and VIF, and the three-component CoT, improving interpretability and scalability for LLM-based embodied agents. Code is released to enable broader adoption and replication of task-adaptive reasoning for VLN.

Abstract

Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions. Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability. However, their predominant use in an offline manner usually suffers from substantial domain gap between the VLN task and the LLM training corpus. This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision, leading to a significant mitigation of the domain gap in a cost-effective manner. Specifically, at each timestep, the LLM is prompted to forecast the navigational chain-of-thought by: 1) acting as a world model to imagine the next observation according to the instruction, 2) selecting the candidate observation that best aligns with the imagination, and 3) determining the action based on the reasoning from the prior steps. Through constructing formalized labels for training, the LLM can learn to generate desired and reasonable chain-of-thought outputs for improving the action decision. Experimental results across various training settings and popular VLN benchmarks (e.g., Room-to-Room (R2R), Room-across-Room (RxR), Room-for-Room (R4R)) show the significant superiority of NavCoT over the direct action prediction variants. Through simple parameter-efficient finetuning, our NavCoT outperforms a recent GPT4-based approach with ~7% relative improvement on the R2R dataset. We believe that NavCoT will help unlock more task-adaptive and scalable LLM-based embodied agents, which are helpful for developing real-world robotics applications. Code is available at https://github.com/expectorlin/NavCoT.

NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

TL;DR

Abstract

Paper Structure (22 sections, 8 equations, 9 figures, 6 tables)

This paper contains 22 sections, 8 equations, 9 figures, 6 tables.

Introduction
Related Work
Vision-Language Navigation
LLMs for Embodied AI
Chain-of-Thought Prompting
Preliminaries
Problem Setup
Large Language Models (LLMs)
Method
Vision-to-Text System
Navigational Chain-of-Thought Prompt
Reasoning Ground-Truth Collection
In-domain Chain-of-Thought Training
Experiments
Experimental Setup
...and 7 more sections

Figures (9)

Figure 1: Comparison between direct action decision and our NavCoT. According to the instruction (finding patio after the sliding glass door) and history ( glass door), NavCoT successfully predicts the future imagination patio, selects the observation C that best matches the imagination and determines the correct action.
Figure 2: Overview of NavCoT. At timestep $t$, we employ a VLM to translate the observation information into textual description. Then, the LLM is prompted with the example and the textual represented navigation input to produce the navigational chain-of-thought. We conduct in-domain training to enable the LLM to learn to generate reasonable navigational reasoning for action decisions.
Figure 3: Failure cases of LLM output in the zero-shot manner. The ground-truth actions are denoted by red boxes.
Figure 4: Comparison of NavCoT with the Direct Action Prediction (DAP) variant under different training settings. In DAP, we directly prompt LLM to generate the action prediction.
Figure 5: Visualization examples of Imagination ground-truth (GT). We do not show the imagination GT for the final step which is "stop".
...and 4 more figures

NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

TL;DR

Abstract

NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)