Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation
Wentao Xiang, Haokang Zhang, Tianhang Yang, Zedong Chu, Ruihang Chu, Shichao Xie, Yujian Yuan, Jian Sun, Zhining Gu, Junjie Wang, Xiaolong Wu, Mu Xu, Yujiu Yang
TL;DR
Nav-$R^2$ tackles open-vocabulary object-goal navigation by explicitly modeling two relational axes—target-environment perception and environment-action planning—within a structured Chain-of-Thought framework and a Similarity-Aware Memory (SA-Mem). A dedicated CoT dataset (~300K samples) is built to train this reasoning, and Nav-$R^2$ is trained with supervised fine-tuning on simulated RGB data without maps or RL, achieving state-of-the-art success rates on OVON while running at ~2Hz. Key contributions include the explicit relational reasoning framework, the SA-Mem memory design with frame compression and fusion, and strong generalization to unseen objects, evidenced by superior OVON performance and robust ablations. The work provides practical implications for scalable, open-world navigation with interpretable reasoning and efficient inference, and releases resources for the community.
Abstract
Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and low success rate on locating unseen objects. To address these challenges, we propose Nav-$R^2$, a framework that explicitly models two critical types of relationships, target-environment modeling and environment-action planning, through structured Chain-of-Thought (CoT) reasoning coupled with a Similarity-Aware Memory. We construct a Nav$R^2$-CoT dataset that teaches the model to perceive the environment, focus on target-related objects in the surrounding context and finally make future action plans. Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives by compressing video frames and fusing historical observations, while introducing no additional parameters. Compared to previous methods, Nav-R^2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline, avoiding overfitting to seen object categories while maintaining real-time inference at 2Hz. Resources will be made publicly available at \href{https://github.com/AMAP-EAI/Nav-R2}{github link}.
