Table of Contents
Fetching ...

Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation

Won Shik Jang, Ue-Hwan Kim

TL;DR

Ablations show that encoding full captions into the value map avoids wasted motion and explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops, suggesting that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.

Abstract

Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textit{Context-Nav} that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers -- guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.

Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation

TL;DR

Ablations show that encoding full captions into the value map avoids wasted motion and explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops, suggesting that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.

Abstract

Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textit{Context-Nav} that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers -- guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.
Paper Structure (15 sections, 4 equations, 4 figures, 3 tables)

This paper contains 15 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the text-goal instance navigation task and our context-driven pipeline. Given a long description that mixes intrinsic attributes (“mainly yellow and green”) with extrinsic context (“located above the cabinet and near the staircase”), the agent explores guided by the context-driven value map and performs viewpoint-aware 3D spatial reasoning. The agent rejects early picture candidates due to a mismatch of either the color or nearby context objects, and ultimately exploits the region containing both the cabinet and staircase, where 3D verification confirms that all intrinsic and extrinsic constraints are satisfied.
  • Figure 2: Overall pipeline of Context-Nav. Given RGB-D observations, odometry, and a free-form text goal, the perception and mapping modules use GOAL-CLIP, open-vocabulary detection, and 3D projection to build an occupancy map, a context-conditioned value map, and an instance-level map. Whenever a target object candidate is detected, the verification module checks intrinsic attributes with a VLM and extrinsic attributes through 3D spatial reasoning to decide whether to terminate or continue exploring.
  • Figure 3: Stage-wise qualitative example of context-driven navigation. An episode where the agent must find a dresser described as “located next to the bed” and “a white dresser with a mirror on top”. Early dresser candidate is not selected because context objects are absent; after the bed is detected, the map concentrates around the corresponding room, frontier selection focuses on that area, and a dresser that satisfies both intrinsic attributes and 3D spatial relations with the bed and mirror is finally verified as the goal.
  • Figure 4: Qualitative results across diverse categories and context descriptions. Successful episodes on CoIN-Bench for nine different target categories, showing top-down trajectories and corresponding goal views. The instructions span a wide range of natural language, from captions that only specify extrinsic context to descriptions that combine intrinsic and extrinsic attributes, and from short hints to detailed multi-sentence goals.