Table of Contents
Fetching ...

VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation

Francesco Taioli, Shiping Yang, Sonia Raychaudhuri, Marco Cristani, Unnat Jain, Angel X Chang

TL;DR

VISOR tackles language-driven object navigation by integrating perception, reasoning, and action into a compact ($3$B) Vision–Language–Action model that directly grounds decisions in visual input. It replaces brittle embedding matching with explicit image-grounded reasoning, producing <think>, <think_summary>, and <action> outputs at each step and using a panoramic view plus an online top-down map to enhance spatial understanding. A novel Waypoint Selection Bench supports supervised fine-tuning for reasoning-enabled navigation, and GSPO-based RL post-training improves navigation efficiency and generalization to unseen environments, while preserving explainability through reasoning traces. The work demonstrates improved robustness to distribution shifts and highlights practical paths toward scalable, interpretable embodied agents that can operate with a single unified model rather than multi-model pipelines.

Abstract

Language-driven object navigation requires agents to interpret natural language descriptions of target objects, which combine intrinsic and extrinsic attributes for instance recognition and commonsense navigation. Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy. To this end, we propose a compact 3B-parameter Vision-Language-Action (VLA) agent that performs human-like embodied reasoning for both object recognition and action selection, removing the need for stitched multi-model pipelines. Instead of raw embedding matching, our agent employs explicit image-grounded reasoning to directly answer "Is this the target object?" and "Why should I take this action?" The reasoning process unfolds in three stages: "think", "think summary", and "action", yielding improved explainability, stronger generalization, and more efficient navigation. Code and dataset available upon acceptance.

VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation

TL;DR

VISOR tackles language-driven object navigation by integrating perception, reasoning, and action into a compact (B) Vision–Language–Action model that directly grounds decisions in visual input. It replaces brittle embedding matching with explicit image-grounded reasoning, producing <think>, <think_summary>, and <action> outputs at each step and using a panoramic view plus an online top-down map to enhance spatial understanding. A novel Waypoint Selection Bench supports supervised fine-tuning for reasoning-enabled navigation, and GSPO-based RL post-training improves navigation efficiency and generalization to unseen environments, while preserving explainability through reasoning traces. The work demonstrates improved robustness to distribution shifts and highlights practical paths toward scalable, interpretable embodied agents that can operate with a single unified model rather than multi-model pipelines.

Abstract

Language-driven object navigation requires agents to interpret natural language descriptions of target objects, which combine intrinsic and extrinsic attributes for instance recognition and commonsense navigation. Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy. To this end, we propose a compact 3B-parameter Vision-Language-Action (VLA) agent that performs human-like embodied reasoning for both object recognition and action selection, removing the need for stitched multi-model pipelines. Instead of raw embedding matching, our agent employs explicit image-grounded reasoning to directly answer "Is this the target object?" and "Why should I take this action?" The reasoning process unfolds in three stages: "think", "think summary", and "action", yielding improved explainability, stronger generalization, and more efficient navigation. Code and dataset available upon acceptance.
Paper Structure (25 sections, 5 equations, 8 figures, 3 tables)

This paper contains 25 sections, 5 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Given an instruction (e.g.,"cabinet with a mirror on top of it, in a bedroom"), VISOR projects the panoramic observation into world coordinates via inverse camera projection. Waypoint candidates ( , , , ) are extracted through a clustering mechanism (Step $1$) and superimposed on the panoramic view, serving as anchors for spatial reasoning. The Reasoning traces (Step $2$) unfolds in three stages: think, think_summary, and action. Then, VISOR selects the most plausible label , projects it into world coordinates, and executes low-level actions via a shortest-path planner.
  • Figure 2: Reasoning capabilities of VISOR. At step $1$, it selects a novel action. At Steps $10$ and $20$ it reason spatially to maximize navigation efficiency. Finally, at step $32$ it successfully stop navigation, recognizing the same objects in instruction $\mathcal{I}$.
  • Figure 3: GPT-4o reasoning traces. The input to the model is "the wardrobe, which is located to the right of the bed. it is positioned below the plant and next to the chest of drawers. the wardrobe is described as a white dresser." The distance to the goal is $2.81$m.
  • Figure 4: Boxplot of instruction lengths (in tokens) for the $\mathcal{D}_{\text{SFT}}$ dataset. Overall, instructions have a mean length of $21.16$ tokens, requiring complex grounding.
  • Figure 5: Failure due to hallucination. The input description is "cabinet with a chalkboard on it. the cabinet is located to the right of the curtain." The hallucination is highlighted in red. In particular, label B does not guide the agent toward the kitchen; instead, it leads the agent to explore the living room area.
  • ...and 3 more figures