Table of Contents
Fetching ...

Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning

Zixuan Wang, Huang Fang, Shaoan Wang, Yuanfei Luo, Heng Dong, Wei Li, Yiming Gan

TL;DR

Hydra-Nav tackles the challenge of object navigation with vision-language models by unifying a slow deliberative planner and a fast reactive controller within a single VLM. It introduces a three-stage curriculum to enhance spatial-action alignment, temporal-spatial memory-based reasoning, and adaptive reasoning through Iterative Rejection Fine-Tuning (IRFT), which triggers reasoning only at stagnation points. The approach yields state-of-the-art results on HM3D, MP3D, and OVON, while introducing the SOT metric to quantify search efficiency under varying reasoning loads. Empirically, adaptive reasoning reduces compute overhead without sacrificing performance, enabling more practical deployment in real-world robotics. The work demonstrates that an end-to-end dual-process VLM can achieve superior navigation performance with substantially improved efficiency.

Abstract

While large vision-language models (VLMs) show promise for object goal navigation, current methods still struggle with low success rates and inefficient localization of unseen objects--failures primarily attributed to weak temporal-spatial reasoning. Meanwhile, recent attempts to inject reasoning into VLM-based agents improve success rates but incur substantial computational overhead. To address both the ineffectiveness and inefficiency of existing approaches, we introduce Hydra-Nav, a unified VLM architecture that adaptively switches between a deliberative slow system for analyzing exploration history and formulating high-level plans, and a reactive fast system for efficient execution. We train Hydra-Nav through a three-stage curriculum: (i) spatial-action alignment to strengthen trajectory planning, (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long-horizon exploration, and (iii) iterative rejection fine-tuning to enable selective reasoning at critical decision points. Extensive experiments demonstrate that Hydra-Nav achieves state-of-the-art performance on the HM3D, MP3D, and OVON benchmarks, outperforming the second-best methods by 11.1%, 17.4%, and 21.2%, respectively. Furthermore, we introduce SOT (Success weighted by Operation Time), a new metric to measure search efficiency across VLMs with varying reasoning intensity. Results show that adaptive reasoning significantly enhances search efficiency over fixed-frequency baselines.

Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning

TL;DR

Hydra-Nav tackles the challenge of object navigation with vision-language models by unifying a slow deliberative planner and a fast reactive controller within a single VLM. It introduces a three-stage curriculum to enhance spatial-action alignment, temporal-spatial memory-based reasoning, and adaptive reasoning through Iterative Rejection Fine-Tuning (IRFT), which triggers reasoning only at stagnation points. The approach yields state-of-the-art results on HM3D, MP3D, and OVON, while introducing the SOT metric to quantify search efficiency under varying reasoning loads. Empirically, adaptive reasoning reduces compute overhead without sacrificing performance, enabling more practical deployment in real-world robotics. The work demonstrates that an end-to-end dual-process VLM can achieve superior navigation performance with substantially improved efficiency.

Abstract

While large vision-language models (VLMs) show promise for object goal navigation, current methods still struggle with low success rates and inefficient localization of unseen objects--failures primarily attributed to weak temporal-spatial reasoning. Meanwhile, recent attempts to inject reasoning into VLM-based agents improve success rates but incur substantial computational overhead. To address both the ineffectiveness and inefficiency of existing approaches, we introduce Hydra-Nav, a unified VLM architecture that adaptively switches between a deliberative slow system for analyzing exploration history and formulating high-level plans, and a reactive fast system for efficient execution. We train Hydra-Nav through a three-stage curriculum: (i) spatial-action alignment to strengthen trajectory planning, (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long-horizon exploration, and (iii) iterative rejection fine-tuning to enable selective reasoning at critical decision points. Extensive experiments demonstrate that Hydra-Nav achieves state-of-the-art performance on the HM3D, MP3D, and OVON benchmarks, outperforming the second-best methods by 11.1%, 17.4%, and 21.2%, respectively. Furthermore, we introduce SOT (Success weighted by Operation Time), a new metric to measure search efficiency across VLMs with varying reasoning intensity. Results show that adaptive reasoning significantly enhances search efficiency over fixed-frequency baselines.
Paper Structure (47 sections, 6 equations, 11 figures, 5 tables, 2 algorithms)

This paper contains 47 sections, 6 equations, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: The architecture of Hydra-Nav. Hydra-Nav receives user instruction, long-term memory, and previous image-action pairs, then outputs reasoning (optionally) and meta-actions. Hydra-Nav adaptively switches between the fast and slow systems by outputting the special transition token obs. Specifically, the panoramic scan triggered by obs is extracted as a new landmark and inserted into the existing long-term memory.
  • Figure 2: Illustration of the context organization of Hydra-Nav during inference. The context starts with a system prompt containing the user instruction and long-term memory. Short-term memory is organized as interleaved image-action pairs. When a transition token is encountered, we update the memory and clear the image-action pairs. Note that each token block in the figure represents a sequence of multiple tokens.
  • Figure 3: An illustration of the data synthesis pipeline used in stage 2. The left side shows our trajectory generation strategy, where the robot visits Point 1 and Point 2 before reaching the goal. The right side illustrates how we prompt Qwen3-VL-235B-Thinking to produce high-quality reasoning traces.
  • Figure 4: Performance analysis of multi-turn IRFT across different benchmarks.
  • Figure 5: Real-world robot platform.
  • ...and 6 more figures