Table of Contents
Fetching ...

FocusVLA: Focused Visual Utilization for Vision-Language-Action Models

Yichi Zhang, Weihao Yuan, Yizhuo Zhang, Xidong Zhang, Jia Wan

Abstract

Vision-Language-Action (VLA) models improve action generation by conditioning policies on rich vision-language information. However, current auto-regressive policies are constrained by three bottlenecks: (1) architectural bias drives models to overlook visual details, (2) an excessive number of visual tokens makes attention difficult to focus on the correct regions, and (3) task-irrelevant visual information introduces substantial noise - together severely impairing the quality of action. In this paper, we investigate how to effectively utilize different visual representations for action generation. To this end, we first empirically validate the above issues and show that VLA performance is primarily limited by how visual information is utilized, rather than by the quality of visual representations. Based on these insights, we introduce FocusVLA, a novel paradigm that directs the model's attention to task-relevant visual regions to effectively bridge vision to action. Specifically, we first propose Modality Cascaded Attention to eliminate shortcut pathways, thereby compelling VLA models to rely on task-relevant visual details for action generation. Furthermore, we propose Focus Attention, which dynamically selects task-relevant visual patches to control information quantity while explicitly modulating their influence to suppress task-irrelevant noise. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that FocusVLA not only effectively leverages visual details to perform dexterous manipulations, but also substantially improves performance and accelerates convergence across a variety of tasks.

FocusVLA: Focused Visual Utilization for Vision-Language-Action Models

Abstract

Vision-Language-Action (VLA) models improve action generation by conditioning policies on rich vision-language information. However, current auto-regressive policies are constrained by three bottlenecks: (1) architectural bias drives models to overlook visual details, (2) an excessive number of visual tokens makes attention difficult to focus on the correct regions, and (3) task-irrelevant visual information introduces substantial noise - together severely impairing the quality of action. In this paper, we investigate how to effectively utilize different visual representations for action generation. To this end, we first empirically validate the above issues and show that VLA performance is primarily limited by how visual information is utilized, rather than by the quality of visual representations. Based on these insights, we introduce FocusVLA, a novel paradigm that directs the model's attention to task-relevant visual regions to effectively bridge vision to action. Specifically, we first propose Modality Cascaded Attention to eliminate shortcut pathways, thereby compelling VLA models to rely on task-relevant visual details for action generation. Furthermore, we propose Focus Attention, which dynamically selects task-relevant visual patches to control information quantity while explicitly modulating their influence to suppress task-irrelevant noise. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that FocusVLA not only effectively leverages visual details to perform dexterous manipulations, but also substantially improves performance and accelerates convergence across a variety of tasks.

Paper Structure

This paper contains 24 sections, 9 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Illustration of structural limitations in existing auto-regressive Vision-Language-Action (VLA) models and the motivation of FocusVLA. OpenVLA-OFT suffers from the lack of direct utilization of visual tokens, limiting its ability to ground actions in fine-grained visual details. VLA-Adapter's mixed attention introduces architectural shortcuts that enable the model to bypass concrete visual details by favoring the easier action-query pathway, leading to imprecise manipulation. Furthermore, it introduces a near-zero gating factor that substantially suppresses visual signals. In contrast, FocusVLA uses Modality Cascaded Attention to enforce sequential modality interaction, ensuring the model relies on task-relevant visual details before action reasoning and eliminating shortcut pathways. In addition, we propose Focus Attention to suppress task-irrelevant information, leading to precise and robust manipulation.
  • Figure 2: Visualization of attention maps on LIBERO simulation (left) and real-world environments (right). We extract the attention scores from the last layer, average them across attention heads and queries, and project them back onto the image plane for visualization. Specifically, for VLM, we use the attention from the action query to visual tokens, while for the policy, we use the attention from the action latent to visual tokens. VLA-Adapter exhibits highly scattered and distracted attention patterns, largely focusing on task-irrelevant regions due to structural shortcuts, thereby failing to capture concrete visual details. In contrast, FocusVLA produces concentrated and task-aligned attention maps that consistently focus on contact regions and manipulation targets. This targeted attention enables more precise action generation across both simulated and real-world settings.
  • Figure 3: Architectures of the four proposed policy variants. (a) Vanilla: A baseline utilizing visual tokens without constraints. (b) Pooling: Patch-level optimization by reducing visual token count via 2×2 pooling. (c) 1-param gate: Channel-level optimization that attenuates visual signal intensity using a single-parameter gate. (d) Cascaded attention: Structure-level optimization that alters feature interaction patterns through cascaded multi-head attention.
  • Figure 4: Success Rates of different policy architectures on the LIBERO-Long. We evaluate four different structures across three distinct visual representations as detailed in Section \ref{['sec:analysis']}. The experimental results reveal that existing policies often suffer from three critical biases: token quantity imbalance, low signal-to-noise ratio, and structural bias. By systematically addressing these issues through our proposed constraints, we achieve consistent and significant performance improvements across all visual representations. These findings demonstrate that VLA performance is primarily limited by how visual information is utilized, rather than by the quality of visual representations. “VLM” denotes the output visual features from the PrismaticVLM trained on Qwen2.5-0.5B qwen2, while “DS” denotes the combined features of DINOv2 and SigLIP dinov2siglip. The red dashed line indicates the performance of a variant that does not use VLM visual features, relying solely on action queries.
  • Figure 5: The architecture of FocusVLA. Our policy explicitly addresses visual utilization bottlenecks through two core components: (1) Cascaded Attention: By enabling the action latent to query each modality independently, the model is forced to focus on and extract only the most task-relevant visual details, ensuring a more targeted feature utilization; and (2) Focus Attention: To resolve the quantity and quality limitations, this component employs patch-level pruning to discard irrelevant tokens and channel-level gated suppression to mitigate feature noise.
  • ...and 13 more figures