Table of Contents
Fetching ...

CL-CoTNav: Closed-Loop Hierarchical Chain-of-Thought for Zero-Shot Object-Goal Navigation with Vision-Language Models

Yuxin Cai, Xiangkun He, Maonan Wang, Hongliang Guo, Wei-Yun Yau, Chen Lv

TL;DR

This work tackles zero-shot Object Navigation by fusing structured hierarchical reasoning with a closed-loop confidence mechanism in a vision-language navigation framework. It introduces CL-CoTNav, which fine-tunes a compact VLM via multi-turn QA derived from human demonstrations, enabling Hierarchical CoT prompting that separates perception from planning and improves generalization to unseen objects and scenes. A Closed-Loop H-CoT mechanism uses confidence scores to adapt training loss, mitigating noisy or hallucinated reasoning and boosting robustness; across AI Habitat MP3D settings, CL-CoTNav achieves state-of-the-art SPL improvements (up to about 22 percentage points) while maintaining strong SR, demonstrating the value of compositional semantic reasoning and uncertainty-aware learning for zero-shot navigation.

Abstract

Visual Object Goal Navigation (ObjectNav) requires a robot to locate a target object in an unseen environment using egocentric observations. However, decision-making policies often struggle to transfer to unseen environments and novel target objects, which is the core generalization problem. Traditional end-to-end learning methods exacerbate this issue, as they rely on memorizing spatial patterns rather than employing structured reasoning, limiting their ability to generalize effectively. In this letter, we introduce Closed-Loop Hierarchical Chain-of-Thought Navigation (CL-CoTNav), a vision-language model (VLM)-driven ObjectNav framework that integrates structured reasoning and closed-loop feedback into navigation decision-making. To enhance generalization, we fine-tune a VLM using multi-turn question-answering (QA) data derived from human demonstration trajectories. This structured dataset enables hierarchical Chain-of-Thought (H-CoT) prompting, systematically extracting compositional knowledge to refine perception and decision-making, inspired by the human cognitive process of locating a target object through iterative reasoning steps. Additionally, we propose a Closed-Loop H-CoT mechanism that incorporates detection and reasoning confidence scores into training. This adaptive weighting strategy guides the model to prioritize high-confidence data pairs, mitigating the impact of noisy inputs and enhancing robustness against hallucinated or incorrect reasoning. Extensive experiments in the AI Habitat environment demonstrate CL-CoTNav's superior generalization to unseen scenes and novel object categories. Our method consistently outperforms state-of-the-art approaches in navigation success rate (SR) and success weighted by path length (SPL) by 22.4\%. We release our datasets, models, and supplementary videos on our project page.

CL-CoTNav: Closed-Loop Hierarchical Chain-of-Thought for Zero-Shot Object-Goal Navigation with Vision-Language Models

TL;DR

This work tackles zero-shot Object Navigation by fusing structured hierarchical reasoning with a closed-loop confidence mechanism in a vision-language navigation framework. It introduces CL-CoTNav, which fine-tunes a compact VLM via multi-turn QA derived from human demonstrations, enabling Hierarchical CoT prompting that separates perception from planning and improves generalization to unseen objects and scenes. A Closed-Loop H-CoT mechanism uses confidence scores to adapt training loss, mitigating noisy or hallucinated reasoning and boosting robustness; across AI Habitat MP3D settings, CL-CoTNav achieves state-of-the-art SPL improvements (up to about 22 percentage points) while maintaining strong SR, demonstrating the value of compositional semantic reasoning and uncertainty-aware learning for zero-shot navigation.

Abstract

Visual Object Goal Navigation (ObjectNav) requires a robot to locate a target object in an unseen environment using egocentric observations. However, decision-making policies often struggle to transfer to unseen environments and novel target objects, which is the core generalization problem. Traditional end-to-end learning methods exacerbate this issue, as they rely on memorizing spatial patterns rather than employing structured reasoning, limiting their ability to generalize effectively. In this letter, we introduce Closed-Loop Hierarchical Chain-of-Thought Navigation (CL-CoTNav), a vision-language model (VLM)-driven ObjectNav framework that integrates structured reasoning and closed-loop feedback into navigation decision-making. To enhance generalization, we fine-tune a VLM using multi-turn question-answering (QA) data derived from human demonstration trajectories. This structured dataset enables hierarchical Chain-of-Thought (H-CoT) prompting, systematically extracting compositional knowledge to refine perception and decision-making, inspired by the human cognitive process of locating a target object through iterative reasoning steps. Additionally, we propose a Closed-Loop H-CoT mechanism that incorporates detection and reasoning confidence scores into training. This adaptive weighting strategy guides the model to prioritize high-confidence data pairs, mitigating the impact of noisy inputs and enhancing robustness against hallucinated or incorrect reasoning. Extensive experiments in the AI Habitat environment demonstrate CL-CoTNav's superior generalization to unseen scenes and novel object categories. Our method consistently outperforms state-of-the-art approaches in navigation success rate (SR) and success weighted by path length (SPL) by 22.4\%. We release our datasets, models, and supplementary videos on our project page.

Paper Structure

This paper contains 19 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of Zero-Shot Object Navigation (ZSON). The left side illustrates the training phase, where an robot trains to navigate within seen scenes sets and towards target objects sets. The right side represents the zero-shot generalization phase, where the trained policy is evaluated in unseen target objects and novel scenes without further training. The figure also highlights the proposed H-CoT process, where it reasons about likely and unlikely object locations based on object-object and object-scene relationships.
  • Figure 2: Overview of CL-CoTNav. We finetune VLM using multi-turn QA data derived from human demonstration trajectories. This structured dataset enables H-CoT prompting, including two main turns: perception and planning, to iteratively extract compositional knowledge from egocentric RGB observations through a sequence of large pre-trained models and finally aligned with human demonstration actions. A confidence scoring system is also generated to evaluate the reliability of each detection and reasoning step, which guide adaptive loss weighting during finetuning to improve robustness against noisy supervision.
  • Figure 3: Zero-shot generalization results on MP3D Val. The figure illustrates how CL-CoTNav navigates in unseen scene layouts. The predicted navigation path is shown in blue and shortest path is shown in green. SPL = 0.71, ep_length = 97.