Table of Contents
Fetching ...

Vision Language Action Models in Robotic Manipulation: A Systematic Review

Muhayy Ud Din, Waseem Akram, Lyes Saad Saoud, Jan Rosell, Irfan Hussain

TL;DR

Vision Language Action models aim to unify perception, language understanding, and embodied control for robotic manipulation. The paper surveys 102 VLA models, 26 datasets, and 12 simulation platforms, proposing a two-dimensional dataset characterization and a modular architectural taxonomy to guide future work. Key contributions include a novel VLA dataset benchmarking framework, an architectural panorama of backbones and decoders, and a synthesis of simulation tools and evaluation practices. The findings highlight rapid progress but also underline challenges in scalable pretraining, sim-to-real transfer, and interpretable grounding, outlining a roadmap toward robust, generalist embodied agents.

Abstract

Vision Language Action (VLA) models represent a transformative shift in robotics, with the aim of unifying visual perception, natural language understanding, and embodied control within a single learning framework. This review presents a comprehensive and forward-looking synthesis of the VLA paradigm, with a particular emphasis on robotic manipulation and instruction-driven autonomy. We comprehensively analyze 102 VLA models, 26 foundational datasets, and 12 simulation platforms that collectively shape the development and evaluation of VLAs models. These models are categorized into key architectural paradigms, each reflecting distinct strategies for integrating vision, language, and control in robotic systems. Foundational datasets are evaluated using a novel criterion based on task complexity, variety of modalities, and dataset scale, allowing a comparative analysis of their suitability for generalist policy learning. We introduce a two-dimensional characterization framework that organizes these datasets based on semantic richness and multimodal alignment, showing underexplored regions in the current data landscape. Simulation environments are evaluated for their effectiveness in generating large-scale data, as well as their ability to facilitate transfer from simulation to real-world settings and the variety of supported tasks. Using both academic and industrial contributions, we recognize ongoing challenges and outline strategic directions such as scalable pretraining protocols, modular architectural design, and robust multimodal alignment strategies. This review serves as both a technical reference and a conceptual roadmap for advancing embodiment and robotic control, providing insights that span from dataset generation to real world deployment of generalist robotic agents.

Vision Language Action Models in Robotic Manipulation: A Systematic Review

TL;DR

Vision Language Action models aim to unify perception, language understanding, and embodied control for robotic manipulation. The paper surveys 102 VLA models, 26 datasets, and 12 simulation platforms, proposing a two-dimensional dataset characterization and a modular architectural taxonomy to guide future work. Key contributions include a novel VLA dataset benchmarking framework, an architectural panorama of backbones and decoders, and a synthesis of simulation tools and evaluation practices. The findings highlight rapid progress but also underline challenges in scalable pretraining, sim-to-real transfer, and interpretable grounding, outlining a roadmap toward robust, generalist embodied agents.

Abstract

Vision Language Action (VLA) models represent a transformative shift in robotics, with the aim of unifying visual perception, natural language understanding, and embodied control within a single learning framework. This review presents a comprehensive and forward-looking synthesis of the VLA paradigm, with a particular emphasis on robotic manipulation and instruction-driven autonomy. We comprehensively analyze 102 VLA models, 26 foundational datasets, and 12 simulation platforms that collectively shape the development and evaluation of VLAs models. These models are categorized into key architectural paradigms, each reflecting distinct strategies for integrating vision, language, and control in robotic systems. Foundational datasets are evaluated using a novel criterion based on task complexity, variety of modalities, and dataset scale, allowing a comparative analysis of their suitability for generalist policy learning. We introduce a two-dimensional characterization framework that organizes these datasets based on semantic richness and multimodal alignment, showing underexplored regions in the current data landscape. Simulation environments are evaluated for their effectiveness in generating large-scale data, as well as their ability to facilitate transfer from simulation to real-world settings and the variety of supported tasks. Using both academic and industrial contributions, we recognize ongoing challenges and outline strategic directions such as scalable pretraining protocols, modular architectural design, and robust multimodal alignment strategies. This review serves as both a technical reference and a conceptual roadmap for advancing embodiment and robotic control, providing insights that span from dataset generation to real world deployment of generalist robotic agents.

Paper Structure

This paper contains 42 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: VLA models, datasets, and contributing institutions from 2022 to 2025. The top row presents major VLA models introduced each year, alongside their associated institutions (logos within red boxes). The bottom row displays key datasets used to train and evaluate these models, grouped by release year. The figure highlights the increasing scale and diversity of datasets and institutional involvement, with contributions from academic (e.g., CMU, CNRS, UC, Peking Uni) and industrial labs (e.g., Google, NVIDIA, Microsoft). This timeline highlights the rapid advancements in VLA research.
  • Figure 2: Overview of the skeleton of the paper, highlighting the main sections and their interrelated subtopics.
  • Figure 3: Annual VLA models and foundational VLA datasets count from 2022 to 2025. Green bars indicate the number of new VLA model introduced each year, while purple bars represent the number of novel dataset releases. The data illustrate a rapid acceleration in model development, particularly in 2025, alongside steady growth in dataset creation to support training and evaluation of these models.
  • Figure 4: An overview of the Transformer architecture highlighting the encoder-decoder structure and the internal mechanism of multi-head attention. The encoder processes input embeddings through layers of multi-head attention, normalization, and feedforward networks. The decoder mirrors this with additional masked attention layers and incorporates encoder outputs for contextual decoding. The magnified view illustrates the scaled dot-product attention and how multiple attention heads are concatenated and linearly transformed to form the final multi-head attention output. The image is adapted from vaswani2017attention
  • Figure 5: Architecture of the ViT. The input image is divided into fixed-size non-overlapping patches which are flattened and linearly projected into embedding vectors. A learnable classification (CLS) token is prepended to the sequence of patch embeddings (shown in darker blue). Positional embeddings are added to retain spatial information before feeding the sequence into a standard Transformer encoder. The output of the CLS token is passed through an MLP head to produce the final class prediction. The image is adpated from dosovitskiy2021an
  • ...and 5 more figures