Table of Contents
Fetching ...

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, Donglin Wang

TL;DR

VLA-Adapter introduces a lightweight bridging paradigm with Bridge Attention to map vision-language representations to robot actions, drastically reducing the need for large VLM pretraining. By systematically analyzing VL conditions and exploiting multi-layer Raw and ActionQuery features, it achieves state-of-the-art-like performance with a 0.5B backbone and fast inference, while enabling training on consumer GPUs in hours. The approach demonstrates strong results across simulated benchmarks (LIBERO, CALVIN ABC→D) and real-world tasks, including long-horizon manipulation, with notable generalization and efficiency gains. These findings suggest a practical path toward deployable VLA systems that minimize data, compute, and tuning costs while preserving high task performance.

Abstract

Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

TL;DR

VLA-Adapter introduces a lightweight bridging paradigm with Bridge Attention to map vision-language representations to robot actions, drastically reducing the need for large VLM pretraining. By systematically analyzing VL conditions and exploiting multi-layer Raw and ActionQuery features, it achieves state-of-the-art-like performance with a 0.5B backbone and fast inference, while enabling training on consumer GPUs in hours. The approach demonstrates strong results across simulated benchmarks (LIBERO, CALVIN ABC→D) and real-world tasks, including long-horizon manipulation, with notable generalization and efficiency gains. These findings suggest a practical path toward deployable VLA systems that minimize data, compute, and tuning costs while preserving high task performance.

Abstract

Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.

Paper Structure

This paper contains 57 sections, 4 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: Characteristics of VLA-Adapter. "$\downarrow$" is that smaller values are better, and vice versa. Our paradigm can effectively obtain the SOTA-level VLA model using a tiny-scale backbone.
  • Figure 2: Existing representative bridge paradigms from VL to A.
  • Figure 3: The proposed VLA framework. The key components are the effective condition exploration and Attention design. "Attention" specifically includes cross attention with conditions and self attention with itself. In the "Unified VLA-Adapter Framework", "Attention" is the Bridge Attention as shown in Section \ref{['Subsection_bridge_attention']}. Four conditions about "layer" and "type" are given on the right.
  • Figure 4: Comparison of four conditions in the VLA-Adapter framework on the LIBERO-Long. Blue and Green lines are single-layer $\mathcal{{\cal C}}_t^\mathcal{R}$ and single-layer $\mathcal{{\cal C}}_t^\mathcal{AQ}$, as in Figure \ref{['Figure_framework']}a) and \ref{['Figure_framework']}b). Blue and Green columns are all-layer $\mathcal{{\cal C}}_t^\mathcal{R}$ and all-layer $\mathcal{{\cal C}}_t^\mathcal{AQ}$, as in Figure \ref{['Figure_framework']}c) and \ref{['Figure_framework']}d). The detailed results are shown in Appendix \ref{['AppendixC']}. Please note: the number of ActionQuery is 64 here. Its number is variable, similar to MetaQueries Metaquery-2025 in MLLM research; we will explore it in Section \ref{['sec45']}.
  • Figure 5: The Policy with Bridge Attention. The Policy parameters are only 97M when the backbone is Qwen2.5-0.5B. Each-layer $\mathcal{{\cal C}}_t^\mathcal{R}$ and $\mathcal{{\cal C}}_t^\mathcal{AQ}$ are integrated in Bridge Attention with the corresponding-layer action latent. Bridge Attention maps VL to Action to the greatest extent. The degree of $\mathcal{{\cal C}}_t^\mathcal{R}$ injection is learnable, ensuring the performance and stability of training.
  • ...and 9 more figures