Table of Contents
Fetching ...

A Survey on Efficient Vision-Language-Action Models

Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Lianli Gao, Jingkuan Song, Nicu Sebe, Heng Tao Shen

TL;DR

This survey addresses the significant efficiency bottlenecks of Vision-Language-Action (VLA) models, which integrate perception, language, and motor control for embodied AI. It proposes a unified taxonomy that partitions efforts into Efficient Model Design, Efficient Training, and Efficient Data Collection to address the end-to-end data-model-training lifecycle. The authors synthesize state-of-the-art techniques, identify representative applications, and chart a roadmap highlighting challenges and future directions toward adaptive, co-designed, edge-friendly VLAs. By consolidating disparate research into a cohesive efficiency-focused framework, the paper aims to accelerate the deployment of scalable, resource-conscious embodied intelligence across domains. Overall, the work provides a foundational reference to guide future research and practical development of Efficient VLAs.

Abstract

Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/

A Survey on Efficient Vision-Language-Action Models

TL;DR

This survey addresses the significant efficiency bottlenecks of Vision-Language-Action (VLA) models, which integrate perception, language, and motor control for embodied AI. It proposes a unified taxonomy that partitions efforts into Efficient Model Design, Efficient Training, and Efficient Data Collection to address the end-to-end data-model-training lifecycle. The authors synthesize state-of-the-art techniques, identify representative applications, and chart a roadmap highlighting challenges and future directions toward adaptive, co-designed, edge-friendly VLAs. By consolidating disparate research into a cohesive efficiency-focused framework, the paper aims to accelerate the deployment of scalable, resource-conscious embodied intelligence across domains. Overall, the work provides a foundational reference to guide future research and practical development of Efficient VLAs.

Abstract

Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/

Paper Structure

This paper contains 48 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Necessity of Efficient VLAs. This figure highlights the disparity between powerful but resource-intensive foundation VLAs and the practical deployment requirements of diverse edge robotic platforms. Bridging this gap by developing more compact, economical, and applicable solutions is the primary motivation for pursuing efficient VLAs.
  • Figure 2: The Organization of Our Survey. We systematically categorize efficient VLAs into three core pillars: (1) Efficient Model Design , encompassing efficient architectures and model compression techniques; (2) Efficient Training , covering efficient pre-training and post-training strategies; and (3) Efficient Data Collection, including efficient data collection and augmentation methods. The framework also reviews VLA foundations, key applications, challenges, and future directions, establishing the groundwork for advancing scalable embodied intelligence.
  • Figure 3: An Overview Of VLAs. VLAs integrate vision encoders to extract visual features, LLM backbones to fuse multimodal inputs, and action decoders (MLP-based, autoregressive, or generative) to produce robotic control signals, enabling end-to-end vision-language-action reasoning for embodied manipulation tasks.
  • Figure 4: Timeline of Foundational VLA Models and Efficient VLAs. The timeline illustrates the progression of foundational VLA models and efficient VLAs from 2023 to 2025, highlighting the explosive growth in enhancing the efficiency of VLA to bridge computational demands with real-world robotic deployment.
  • Figure 5: Key strategies for Efficient Architectures (\ref{['subsec:efficientarchitectures']}) in VLAs. We illustrate six primary approaches: (a) Efficient Attention (\ref{['subsubsec:efficientattention']}), mitigating the $O(n^2)$ complexity of standard self-attention; (b) Transformer Alternatives (\ref{['subsubsec:transformeralternatives']}), such as Mamba; (c) Efficient Action Decoding (\ref{['subsubsec:efficientactiondecoding']}), advancing from autoregressive generation to parallel and generative methods; (d) Lightweight Components (\ref{['subsubsec:lightweightcomponent']}), adopting smaller model backbones; (e) Mixture-of-Experts (\ref{['subsubsec:moe']}), employing sparse activation via input routing; and (f) Hierarchical Systems (\ref{['subsubsec:hierarchicalsystems']}), which decouple high-level VLM planning from low-level VLA execution.
  • ...and 3 more figures