Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

Ninghao Zhang; Bin Zhu; Shijie Zhou; Jingjing Chen

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

Ninghao Zhang, Bin Zhu, Shijie Zhou, Jingjing Chen

TL;DR

This paper reveals a critical failure mode in which VLA policies continue executing visually plausible actions even when the language instruction contradicts the scene, known as linguistic blindness, and proposes Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that rebalances attention distributions to restore the influence of language instructions.

Abstract

Vision-Language-Action (VLA) models enable robots to perform manipulation tasks directly from natural language instructions and are increasingly viewed as a foundation for generalist robotic policies. However, their reliability under Out-of-Distribution (OOD) instructions remains underexplored. In this paper, we reveal a critical failure mode in which VLA policies continue executing visually plausible actions even when the language instruction contradicts the scene. We refer to this phenomenon as linguistic blindness, where VLA policies prioritize visual priors over instruction semantics during action generation. To systematically analyze this issue, we introduce ICBench, a diagnostic benchmark constructed from the LIBERO dataset that probes language-action coupling by injecting controlled OOD instruction contradictions while keeping the visual environment unchanged. Evaluations on three representative VLA architectures, including Pi0, Pi0.5 and OpenVLA OFT, show that these models frequently succeed at tasks despite logically impossible instructions, revealing a strong visual bias in action generation. To mitigate this issue, we propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that rebalances attention distributions to restore the influence of language instructions. IGAR operates without retraining or architectural modification and can be directly applied to existing VLA models. Experiments across 30 LIBERO tasks demonstrate that IGAR substantially reduces erroneous execution under OOD contradictory instructions while preserving baseline task performance. We additionally validate the approach on a real Franka robotic arm, where IGAR effectively prevents manipulation triggered by inconsistent instructions.

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

TL;DR

Abstract

Paper Structure (20 sections, 9 equations, 5 figures, 3 tables)

This paper contains 20 sections, 9 equations, 5 figures, 3 tables.

Introduction
Related Work
Vision-Language-Action Models
Linguistic Grounding and Modality Bias
ICBench: A Controlled Instruction Contradiction Benchmark
Instruction Contradiction Construction
Contradiction Taxonomy and Design Principles
IGAR: Instruction-Guided Attention Recalibration
Problem Formulation
Instruction-Guided Attention Recalibration
Grounding Head Selection.
Attention Redistribution.
Experiments
Experimental Setup
Diagnosing Linguistic Blindness
...and 5 more sections

Figures (5)

Figure 1: Linguistic blindness in Vision-Language-Action (VLA) models. Under normal instructions (left), the robot completes the task correctly. Under contradictory instructions (right), a structured form of OOD linguistic input, the robot often follows the same visually plausible trajectory while ignoring the instruction.
Figure 2: Overview of the IGAR framework. IGAR is a train-free and plug-and-play intervention that restores linguistic grounding in VLA models via three stages: (1) detecting attention sink tokens through hidden-state spike analysis, (2) selecting grounding heads that exhibit cross-modal imbalance, and (3) redistributing attention from sink tokens to instruction tokens.
Figure 3: Attention visualization of OpenVLA-OFT with and without IGAR. We visualize cross-modal attention maps under normal and contradictory instructions. The baseline policy attends primarily to salient regions regardless of instruction semantics, while IGAR redistributes attention toward instruction-relevant objects and spatial regions, mitigating visual attention sinks and improving linguistic grounding.
Figure 4: Hyperparameter Sensitivity. Impact of the text-sink decay factor ($p$), head selection bound ($\rho$), and number of intervened layers ($L$) in terms of linguistic grounding performance (LGS). Results are reported using the OpenVLA-OFT architecture on the libero_goal benchmark suite. The dashed lines indicate the selected values.
Figure 5: Real-world experiments. We test the $\pi_0$ policy with and without IGAR on the task "placing the blue cube into the open drawer". Under normal instructions (left), both policies successfully place the blue cube into the open drawer. Under contradictory instructions (right), the $\pi_0$ policy still executes a visually plausible trajectory and produces a fake success. In contrast, IGAR restores linguistic grounding and prevents incorrect task execution, resulting in a deserved failure.

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

TL;DR

Abstract

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

Authors

TL;DR

Abstract

Table of Contents

Figures (5)