Table of Contents
Fetching ...

InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

Ji Zhang, Shihan Wu, Xu Luo, Hao Wu, Lianli Gao, Heng Tao Shen, Jingkuan Song

TL;DR

<3-5 sentence high-level summary> This paper tackles the problem of spurious correlations in Vision-Language-Action models that hamper generalization when mapping language and visual inputs to robot actions. It introduces Intrinsic Spatial Reasoning (InSpire), which injects a spatial reasoning VQA step to derive a task-relevant representation u' and conditions action generation on (o,l,u'), functioning as a plug-in without extra data or external models. Extensive experiments on LIBERO, CALVIN, and real-world tasks demonstrate consistent improvements, with particular gains from 1D directional spatial reasoning and strategic VQA insertion. The approach reduces shortcut learning, improves robustness to distractors, and enables continual action correction, highlighting its practical impact for more reliable robotic manipulation.

Abstract

Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question "In which direction is the [object] relative to the robot?" to the language instruction and aligning the answer "right/left/up/down/front/back/grasped" and predicted actions with ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach.

InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

TL;DR

<3-5 sentence high-level summary> This paper tackles the problem of spurious correlations in Vision-Language-Action models that hamper generalization when mapping language and visual inputs to robot actions. It introduces Intrinsic Spatial Reasoning (InSpire), which injects a spatial reasoning VQA step to derive a task-relevant representation u' and conditions action generation on (o,l,u'), functioning as a plug-in without extra data or external models. Extensive experiments on LIBERO, CALVIN, and real-world tasks demonstrate consistent improvements, with particular gains from 1D directional spatial reasoning and strategic VQA insertion. The approach reduces shortcut learning, improves robustness to distractors, and enables continual action correction, highlighting its practical impact for more reliable robotic manipulation.

Abstract

Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question "In which direction is the [object] relative to the robot?" to the language instruction and aligning the answer "right/left/up/down/front/back/grasped" and predicted actions with ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach.

Paper Structure

This paper contains 25 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) VLAs typically predict actions relying on Spurious Correlations learned by the direct observation-to-action mapping mechanism. (b) The core idea of our proposed InSpire method that tackles spurious correlations by boosting the spatial reasoning capabilities of VLAs.
  • Figure 2: Overview of our InSpire approach. InSpire boosts the VLA's spatial reasoning ability by appending the question “$\mathsf{In\,\, which \,\, direction\,\, is\,\, the\,\, [object]\,\, relative\,\, to\,\, the\,\, robot ?}$” before the language instruction and aligning the VLA's answer “$\mathsf{right/left/up/down/front/back/grasped}$” and predicted actions with the ground-truth.
  • Figure 3: Automated rule-based object direction labeling. At each waypoint of a trajectory, the 3D locations of the robot's gripper and target objects are obtained from the simulation environment or recorded positions where the robot interacts with objects in the real-world environment. These locations are used in a rule-based strategy to automatically compute the object's direction.
  • Figure 4: CALVIN Performance.
  • Figure 5: Real-world Performance. (a)(b) Success rates (%) of the state-of-the-art model $\pi_0$-FAST pertsch2025fast integrated w/ or w/o InSpire on seen and unseen real-world manipulation tasks. (c) Average time cost per step (in seconds) over all seen and unseen tasks. $\Delta$: absolute improvement.
  • ...and 2 more figures