Table of Contents
Fetching ...

Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation

Wentao Xiang, Haokang Zhang, Tianhang Yang, Zedong Chu, Ruihang Chu, Shichao Xie, Yujian Yuan, Jian Sun, Zhining Gu, Junjie Wang, Xiaolong Wu, Mu Xu, Yujiu Yang

TL;DR

Nav-$R^2$ tackles open-vocabulary object-goal navigation by explicitly modeling two relational axes—target-environment perception and environment-action planning—within a structured Chain-of-Thought framework and a Similarity-Aware Memory (SA-Mem). A dedicated CoT dataset (~300K samples) is built to train this reasoning, and Nav-$R^2$ is trained with supervised fine-tuning on simulated RGB data without maps or RL, achieving state-of-the-art success rates on OVON while running at ~2Hz. Key contributions include the explicit relational reasoning framework, the SA-Mem memory design with frame compression and fusion, and strong generalization to unseen objects, evidenced by superior OVON performance and robust ablations. The work provides practical implications for scalable, open-world navigation with interpretable reasoning and efficient inference, and releases resources for the community.

Abstract

Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and low success rate on locating unseen objects. To address these challenges, we propose Nav-$R^2$, a framework that explicitly models two critical types of relationships, target-environment modeling and environment-action planning, through structured Chain-of-Thought (CoT) reasoning coupled with a Similarity-Aware Memory. We construct a Nav$R^2$-CoT dataset that teaches the model to perceive the environment, focus on target-related objects in the surrounding context and finally make future action plans. Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives by compressing video frames and fusing historical observations, while introducing no additional parameters. Compared to previous methods, Nav-R^2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline, avoiding overfitting to seen object categories while maintaining real-time inference at 2Hz. Resources will be made publicly available at \href{https://github.com/AMAP-EAI/Nav-R2}{github link}.

Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation

TL;DR

Nav- tackles open-vocabulary object-goal navigation by explicitly modeling two relational axes—target-environment perception and environment-action planning—within a structured Chain-of-Thought framework and a Similarity-Aware Memory (SA-Mem). A dedicated CoT dataset (~300K samples) is built to train this reasoning, and Nav- is trained with supervised fine-tuning on simulated RGB data without maps or RL, achieving state-of-the-art success rates on OVON while running at ~2Hz. Key contributions include the explicit relational reasoning framework, the SA-Mem memory design with frame compression and fusion, and strong generalization to unseen objects, evidenced by superior OVON performance and robust ablations. The work provides practical implications for scalable, open-world navigation with interpretable reasoning and efficient inference, and releases resources for the community.

Abstract

Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and low success rate on locating unseen objects. To address these challenges, we propose Nav-, a framework that explicitly models two critical types of relationships, target-environment modeling and environment-action planning, through structured Chain-of-Thought (CoT) reasoning coupled with a Similarity-Aware Memory. We construct a Nav-CoT dataset that teaches the model to perceive the environment, focus on target-related objects in the surrounding context and finally make future action plans. Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives by compressing video frames and fusing historical observations, while introducing no additional parameters. Compared to previous methods, Nav-R^2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline, avoiding overfitting to seen object categories while maintaining real-time inference at 2Hz. Resources will be made publicly available at \href{https://github.com/AMAP-EAI/Nav-R2}{github link}.

Paper Structure

This paper contains 12 sections, 8 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overview of Nav-$R^2$ during an object-goal navigation episode (target: Microwave) in an unseen indoor environment and the results on OVON-val-unseen datasets. At each step, the agent parses the current observation to extract two core relations among three key entities: the environment–task relation (e.g., linking the navigation task to relevant regions such as kitchens or hallways) and the environment–target relation (e.g., associating visible objects such as cabinets or countertops with the likelihood of containing the target.
  • Figure 2: An example showing how our CoT dataset is constructed and what the joint-thinking procedure is upon receiving a new front-view image.
  • Figure 3: Architecture of Nav-$R^2$. Right side shows the architecture, input sequence and output sequence. Left side shows both two mechanisms applied in SA-Mem: 1) Memory Maintenance, and 2) Frame Compression. Mechanisms assist SA-Mem to extract and maintain spatially and temporally vital features.
  • Figure 4: Visualization of trajectory on OVONyokoyama2024hm3d@hm3dovon val-unseen dataset during evaluation. There are in total tens of frames and we select and show the key frames here.
  • Figure 5: Visualization of CoT content when our Nav-$R^2$ is searching for an unseen object in datasetyokoyama2024hm3d@hm3dovon. Nav-$R^2$ first percept the surroundings, then identify and finally make a plan. Blue line is how the agent actually walks, and the green line is the expert trajectory collected and saved in the dataset.