Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning
Zixuan Wang, Huang Fang, Shaoan Wang, Yuanfei Luo, Heng Dong, Wei Li, Yiming Gan
TL;DR
Hydra-Nav tackles the challenge of object navigation with vision-language models by unifying a slow deliberative planner and a fast reactive controller within a single VLM. It introduces a three-stage curriculum to enhance spatial-action alignment, temporal-spatial memory-based reasoning, and adaptive reasoning through Iterative Rejection Fine-Tuning (IRFT), which triggers reasoning only at stagnation points. The approach yields state-of-the-art results on HM3D, MP3D, and OVON, while introducing the SOT metric to quantify search efficiency under varying reasoning loads. Empirically, adaptive reasoning reduces compute overhead without sacrificing performance, enabling more practical deployment in real-world robotics. The work demonstrates that an end-to-end dual-process VLM can achieve superior navigation performance with substantially improved efficiency.
Abstract
While large vision-language models (VLMs) show promise for object goal navigation, current methods still struggle with low success rates and inefficient localization of unseen objects--failures primarily attributed to weak temporal-spatial reasoning. Meanwhile, recent attempts to inject reasoning into VLM-based agents improve success rates but incur substantial computational overhead. To address both the ineffectiveness and inefficiency of existing approaches, we introduce Hydra-Nav, a unified VLM architecture that adaptively switches between a deliberative slow system for analyzing exploration history and formulating high-level plans, and a reactive fast system for efficient execution. We train Hydra-Nav through a three-stage curriculum: (i) spatial-action alignment to strengthen trajectory planning, (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long-horizon exploration, and (iii) iterative rejection fine-tuning to enable selective reasoning at critical decision points. Extensive experiments demonstrate that Hydra-Nav achieves state-of-the-art performance on the HM3D, MP3D, and OVON benchmarks, outperforming the second-best methods by 11.1%, 17.4%, and 21.2%, respectively. Furthermore, we introduce SOT (Success weighted by Operation Time), a new metric to measure search efficiency across VLMs with varying reasoning intensity. Results show that adaptive reasoning significantly enhances search efficiency over fixed-frequency baselines.
