VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation
Francesco Taioli, Shiping Yang, Sonia Raychaudhuri, Marco Cristani, Unnat Jain, Angel X Chang
TL;DR
VISOR tackles language-driven object navigation by integrating perception, reasoning, and action into a compact ($3$B) Vision–Language–Action model that directly grounds decisions in visual input. It replaces brittle embedding matching with explicit image-grounded reasoning, producing <think>, <think_summary>, and <action> outputs at each step and using a panoramic view plus an online top-down map to enhance spatial understanding. A novel Waypoint Selection Bench supports supervised fine-tuning for reasoning-enabled navigation, and GSPO-based RL post-training improves navigation efficiency and generalization to unseen environments, while preserving explainability through reasoning traces. The work demonstrates improved robustness to distribution shifts and highlights practical paths toward scalable, interpretable embodied agents that can operate with a single unified model rather than multi-model pipelines.
Abstract
Language-driven object navigation requires agents to interpret natural language descriptions of target objects, which combine intrinsic and extrinsic attributes for instance recognition and commonsense navigation. Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy. To this end, we propose a compact 3B-parameter Vision-Language-Action (VLA) agent that performs human-like embodied reasoning for both object recognition and action selection, removing the need for stitched multi-model pipelines. Instead of raw embedding matching, our agent employs explicit image-grounded reasoning to directly answer "Is this the target object?" and "Why should I take this action?" The reasoning process unfolds in three stages: "think", "think summary", and "action", yielding improved explainability, stronger generalization, and more efficient navigation. Code and dataset available upon acceptance.
