Table of Contents
Fetching ...

SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation

Leyuan Fang, Zan Mao, Zijing Wang, Yinlong Yan

Abstract

Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models' comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, resulting in inefficient or failed navigation. We observe that inherent relationships among objects and regions encode structured scene priors, which help agents infer plausible target locations even under partial observations. Motivated by this insight, we propose Spatial Relation-aware Navigation (SR-Nav), a framework that models both observed and experience-based spatial relationships to enhance both perception and planning. Specifically, SR-Nav first constructs a Dynamic Spatial Relationship Graph (DSRG) that encodes the target-centered spatial relationships through the foundation models and updates dynamically with real-time observations. We then introduce a Relation-aware Matching Module. It utilizes relationship matching instead of naive detection, leveraging diverse relationships in the DSRG to verify and correct errors, enhancing visual perception robustness. Finally, we design a Dynamic Relationship Planning Module to reduce the planning search space by dynamically computing the optimal paths based on the DSRG from the current position, thereby guiding planning and reducing exploration redundancy. Experiments on HM3D show that our method achieves state-of-the-art performance in both success rate and navigation efficiency. The code will be publicly available at https://github.com/Mzyw-1314/SR-Nav

SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation

Abstract

Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models' comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, resulting in inefficient or failed navigation. We observe that inherent relationships among objects and regions encode structured scene priors, which help agents infer plausible target locations even under partial observations. Motivated by this insight, we propose Spatial Relation-aware Navigation (SR-Nav), a framework that models both observed and experience-based spatial relationships to enhance both perception and planning. Specifically, SR-Nav first constructs a Dynamic Spatial Relationship Graph (DSRG) that encodes the target-centered spatial relationships through the foundation models and updates dynamically with real-time observations. We then introduce a Relation-aware Matching Module. It utilizes relationship matching instead of naive detection, leveraging diverse relationships in the DSRG to verify and correct errors, enhancing visual perception robustness. Finally, we design a Dynamic Relationship Planning Module to reduce the planning search space by dynamically computing the optimal paths based on the DSRG from the current position, thereby guiding planning and reducing exploration redundancy. Experiments on HM3D show that our method achieves state-of-the-art performance in both success rate and navigation efficiency. The code will be publicly available at https://github.com/Mzyw-1314/SR-Nav
Paper Structure (14 sections, 9 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 14 sections, 9 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: (a) Our Spatial Relationship Graph (SRG) integrates experiential spatial priors and perceptual observations to enhance target perception via relational matching and guide efficient path planning through spatial reasoning. (b) Comparison of successful episodes by step count: our perception+experience approach (red) achieves higher success rates at fewer exploration steps than perception-only methods (blue), demonstrating that experiential knowledge provides effective guidance in early, observation-limited navigation.
  • Figure 2: Comparison with existing methods. (a) Geometry-based methods perceive targets via object detection and solely rely on geometric cues for planning. (b) Semantic-based methods incorporate target semantics to guide planning. (c) Our method employs spatial relationship matching for target perception and utilizes target-associated spatial relationships for planning guidance.
  • Figure 3: Compared to scene graphs, which are limited by perception and mostly target-irrelevant, our method leverages both observed and experience-based relationships, which are predictive and target-relevant, enabling more effective navigation guidance.
  • Figure 4: Overview of the SR-Nav. The LLM initializes a Dynamic Spatial Relationships Graph (DSRG) for the target. At each time-step, the agent collects RGB-D observations and updates the DSRG via VLM reasoning. The RAMM module corrects FP/FN errors using spatial priors. When the target is not detected, DRPM integrates DSRG localization and relational cues to generate prompts, ranks scene regions via VLM semantic similarity, and selects the highest-scoring frontier for local navigation.
  • Figure 5: Relationship-aware Matching Module. This module refines raw detections using target-related spatial priors from the DSRG. It uses relationship matching to suppress false positives and identify potential false negatives, ensuring more reliable detection results.
  • ...and 6 more figures