Table of Contents
Fetching ...

Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

Anupam Pani, Yanchao Yang

Abstract

Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns -- offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models' internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer's attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.

Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

Abstract

Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns -- offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models' internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer's attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.
Paper Structure (71 sections, 21 equations, 13 figures, 13 tables, 2 algorithms)

This paper contains 71 sections, 21 equations, 13 figures, 13 tables, 2 algorithms.

Figures (13)

  • Figure 1: Effect of Gaze Regularization. The baseline (middle) exhibits scattered attention across the scene, while the gaze-regularized model (right) concentrates on task-relevant regions (the plate and its immediate surroundings). This focused attention pattern not only improves task performance but also provides interpretable visual grounding that enhances trust in the model.
  • Figure 2: Overview of the Proposed Gaze-Regularized VLA Framework.Left: During training, gaze priors are converted into patch-level gaze distributions that match the transformer’s attention resolution. The KL divergence between gaze and model attention is minimized, guiding the model to align its visual focus with human fixation patterns over time. Right: During inference, the policy operates without any gaze input. Visual, language, and proprioceptive tokens are processed by the vision–language backbone and action head, and fused through causal attention to produce action features, which are mapped by the action decoder to control outputs. This training-time regularization yields gaze-aligned internal representations while maintaining a lightweight, gaze-free inference pipeline.
  • Figure 3: Temporally Aggregated Gaze Prior Generation. A sequence of $k$ video frames is tokenized and processed by the GLC lai2022eye module, which predicts per-frame gaze heatmaps using both past and future context. These heatmaps are temporally aggregated to yield a gaze distribution that captures attention over time and serves as the supervision signal for training- time regularization.
  • Figure 4: Closer look at Gaze Prior Generation A sequence of $k$ video frames is tokenized and processed by the GLC lai2022eye module, where it utilizes global tokens (derived from the sequence) and local tokens, and undergoes self attention as well as Global-Local Correlation to then predict per-frame gaze heatmaps. These heatmaps are temporally aggregated to yield a gaze distribution that captures attention over time and serves as the supervision signal for training- time regularization.
  • Figure 5: Additional Visualisations of Attention. Given the input observation, we show the spatial attention from the baseline model (second), the attention obtained when a perturbed gaze variant is used (third, corresponding to Table \ref{['tab:libero_spatial_30k_full']}), and finally the sharper, task-relevant attention produced by our gaze-regularized model (fourth).
  • ...and 8 more figures