Table of Contents
Fetching ...

Your Vision-Language-Action Model Already Has Attention Heads For Path Deviation Detection

Jaehwan Jeong, Evelyn Zhu, Jinying Lin, Emmanuel Jaimes, Tuan-Anh Vu, Jungseock Joo, Sangpil Kim, M. Khalid Jawed

Abstract

Vision-Language-Action (VLA) models have demonstrated strong potential for predicting semantic actions in navigation tasks, demonstrating the ability to reason over complex linguistic instructions and visual contexts. However, they are fundamentally hindered by visual-reasoning hallucinations that lead to trajectory deviations. Addressing this issue has conventionally required training external critic modules or relying on complex uncertainty heuristics. In this work, we discover that monitoring a few attention heads within a frozen VLA model can accurately detect path deviations without incurring additional computational overhead. We refer to these heads, which inherently capture the spatiotemporal causality between historical visual sequences and linguistic instructions, as Navigation Heads. Using these heads, we propose an intuitive, training-free anomaly-detection framework that monitors their signals to detect hallucinations in real time. Surprisingly, among over a thousand attention heads, a combination of just three is sufficient to achieve a 44.6 % deviation detection rate with a low false-positive rate of 11.7 %. Furthermore, upon detecting a deviation, we bypass the heavy VLA model and trigger a lightweight Reinforcement Learning (RL) policy to safely execute a shortest-path rollback. By integrating this entire detection-to-recovery pipeline onto a physical robot, we demonstrate its practical robustness. All source code will be publicly available.

Your Vision-Language-Action Model Already Has Attention Heads For Path Deviation Detection

Abstract

Vision-Language-Action (VLA) models have demonstrated strong potential for predicting semantic actions in navigation tasks, demonstrating the ability to reason over complex linguistic instructions and visual contexts. However, they are fundamentally hindered by visual-reasoning hallucinations that lead to trajectory deviations. Addressing this issue has conventionally required training external critic modules or relying on complex uncertainty heuristics. In this work, we discover that monitoring a few attention heads within a frozen VLA model can accurately detect path deviations without incurring additional computational overhead. We refer to these heads, which inherently capture the spatiotemporal causality between historical visual sequences and linguistic instructions, as Navigation Heads. Using these heads, we propose an intuitive, training-free anomaly-detection framework that monitors their signals to detect hallucinations in real time. Surprisingly, among over a thousand attention heads, a combination of just three is sufficient to achieve a 44.6 % deviation detection rate with a low false-positive rate of 11.7 %. Furthermore, upon detecting a deviation, we bypass the heavy VLA model and trigger a lightweight Reinforcement Learning (RL) policy to safely execute a shortest-path rollback. By integrating this entire detection-to-recovery pipeline onto a physical robot, we demonstrate its practical robustness. All source code will be publicly available.
Paper Structure (28 sections, 9 equations, 19 figures, 14 tables)

This paper contains 28 sections, 9 equations, 19 figures, 14 tables.

Figures (19)

  • Figure 1: Conceptual teaser of our approach to mitigating hallucinations in VLA-based navigation. 1. The Problem: VLA models are highly vulnerable to vision-language reasoning hallucinations, which cause severe deviations from planned trajectories. 2. The Discovery: We reveal that a specific subset of internal attention heads, termed Navigation Heads ($H_\texttt{nav}$), is highly sensitive to the robot's navigation states. This allows them to serve as a built-in anomaly detector with near-zero computational overhead. 3. Our Solution: Building on this insight, we enable real-time, training-free path deviation detection. Upon detecting a failure, the system immediately bypasses the computationally heavy VLA model and triggers a low-level Reinforcement Learning (RL) policy to execute a safe, collision-free rollback.
  • Figure 2: Overview of the path recovery navigation framework. Deployed via ROS 2, our hierarchical system integrates a high-level VLA model (0.3 Hz) and a low-level RL policy (10 Hz). Using RGB, instructions, and pose data, the VLA detects real-time path deviations ($\mathcal{N} \to \mathcal{A}$) and triggers a rollback to the last verified checkpoint $(x, y, \theta)$. Concurrently, the RL policy ($\pi$) utilizes LiDAR costmaps to output collision-free velocity commands $[v, \omega]$. This design enables the robot to dynamically avoid obstacles and safely return to the normal path via the quickest route during recovery.
  • Figure 3: Our core approach consists of four consecutive stages: (i) Phase Labeling: Formulating ground-truth labels based on path deviation. (ii) Head Selection: Identifying a subset of Navigation heads $H_\texttt{nav}$ sensitive to these state transitions. (iii) Anomaly Detection: Monitoring the entropy ($\mathcal{R}_t$) of $H_\texttt{nav}$ to detect real-time failures and preserve safe checkpoints ($\mathcal{C}_\texttt{safe}$). (iv) Action Policy: Deploying an RL policy for collision-free navigation and rollbacks triggered by the detected anomalies.
  • Figure 4: Training-free spatiotemporal grounding framework for navigation. To select robust navigation heads, candidates are first evaluated using the alignment score $I_{\texttt{diag}}(h)$, which comprises even frame energy ($S_{\texttt{uniform}}$), focused instruction attention ($S_{\texttt{peak}}$), diagonal alignment ($S_{\texttt{diag}}$), and smooth transition ($S_{\texttt{shift}}$), and subsequently assessed for their sensitivity to attention changes between normal and anomalous navigation.
  • Figure 5: The network processes $128\times128$ visual costmaps through a CNN encoder to extract spatial obstacle features, and relative local subgoal states $(x,y)$ via an MLP encoder for directional guidance. These features are fused to feed two specialized heads: the Actor head, which outputs Gaussian distributions for velocity control commands $(\upsilon,\omega)$, and the Critic head, which estimates the state-value $V(s)$. This hierarchical design enables the agent to learn reactive obstacle avoidance and robust subgoal navigation through advantage-based reinforcement learning.
  • ...and 14 more figures