A Survey on Efficient Vision-Language-Action Models

Zhaoshu Yu; Bo Wang; Pengpeng Zeng; Haonan Zhang; Ji Zhang; Lianli Gao; Jingkuan Song; Nicu Sebe; Heng Tao Shen

A Survey on Efficient Vision-Language-Action Models

Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Lianli Gao, Jingkuan Song, Nicu Sebe, Heng Tao Shen

TL;DR

This survey addresses the significant efficiency bottlenecks of Vision-Language-Action (VLA) models, which integrate perception, language, and motor control for embodied AI. It proposes a unified taxonomy that partitions efforts into Efficient Model Design, Efficient Training, and Efficient Data Collection to address the end-to-end data-model-training lifecycle. The authors synthesize state-of-the-art techniques, identify representative applications, and chart a roadmap highlighting challenges and future directions toward adaptive, co-designed, edge-friendly VLAs. By consolidating disparate research into a cohesive efficiency-focused framework, the paper aims to accelerate the deployment of scalable, resource-conscious embodied intelligence across domains. Overall, the work provides a foundational reference to guide future research and practical development of Efficient VLAs.

Abstract

Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/

A Survey on Efficient Vision-Language-Action Models

TL;DR

Abstract

A Survey on Efficient Vision-Language-Action Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)