InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

Ji Zhang; Shihan Wu; Xu Luo; Hao Wu; Lianli Gao; Heng Tao Shen; Jingkuan Song

InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

Ji Zhang, Shihan Wu, Xu Luo, Hao Wu, Lianli Gao, Heng Tao Shen, Jingkuan Song

TL;DR

<3-5 sentence high-level summary> This paper tackles the problem of spurious correlations in Vision-Language-Action models that hamper generalization when mapping language and visual inputs to robot actions. It introduces Intrinsic Spatial Reasoning (InSpire), which injects a spatial reasoning VQA step to derive a task-relevant representation u' and conditions action generation on (o,l,u'), functioning as a plug-in without extra data or external models. Extensive experiments on LIBERO, CALVIN, and real-world tasks demonstrate consistent improvements, with particular gains from 1D directional spatial reasoning and strategic VQA insertion. The approach reduces shortcut learning, improves robustness to distractors, and enables continual action correction, highlighting its practical impact for more reliable robotic manipulation.

Abstract

Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question "In which direction is the [object] relative to the robot?" to the language instruction and aligning the answer "right/left/up/down/front/back/grasped" and predicted actions with ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach.

InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

TL;DR

Abstract

InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)