Table of Contents
Fetching ...

Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.

Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.
Paper Structure (16 sections, 4 equations, 4 figures, 3 tables)

This paper contains 16 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Layer-wise analysis of visual grounding in VLA models. Top: Effect of masking ROI visual tokens on action prediction error (MSE) at different layers. Masking in shallow layers significantly degrades performance, while the impact diminishes in deeper layers. Bottom: Action-to-vision attention maps across layers. Attention is concentrated on task-relevant regions in shallow layers but becomes increasingly diffuse in deeper layers, indicating weakened visual grounding.
  • Figure 2: Framework. (a) A high-resolution Vision Expert is coupled with the LLM backbone through the proposed Vision–Language Mixture-of-Transformers (VL-MoT) framework, where deep LLM layers share attention with the Vision Expert to enhance visual grounding for action prediction. (b) Action-to-vision attention from shallow LLM layers is aggregated to identify task-relevant regions, which are used to prune Vision Expert tokens before fusion. (c) Vision Expert tokens use bidirectional attention to preserve their pretrained knowledge. VLA tokens apply causal attention to prompts and bidirectional attention to action tokens for parallel prediction.
  • Figure 3: Ablation studies. (a) Comparison of different paradigms for integrating Vision Expert features into the VLA backbone. (b) Evaluation of different multi-level feature selection strategies from the Vision Expert. (c) Comparison of different guidance mechanisms for generating the visual pruning map in AGVP. (d) Analysis of different shallow-layer references used to compute the action-to-vision attention map for pruning.
  • Figure 4: Visualization of task execution processes by real-world single-arm robots (from left to right).