VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

Zhijie Wang; Zhehua Zhou; Jiayang Song; Yuheng Huang; Zhan Shu; Lei Ma

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, Lei Ma

TL;DR

VLATest introduces a generation-based fuzzing framework for Vision-Language-Action models in robotic manipulation, implemented in Maniskill2, and conducts a large-scale empirical study across four tasks and seven VLA models. The study reveals substantial robustness gaps: confounding objects, lighting, camera pose, unseen objects, and instruction mutations all significantly impair performance, with large models showing relatively better resilience but none approaching deployment readiness. Key contributions include the testing framework, an 18,604-scene benchmark, and actionable insights into model scaling, prompting, data augmentation, and evaluation practices. The work emphasizes the need for comprehensive quality assurance of VLA models and provides artifacts to enable replication and further benchmarking, potentially guiding safer and more reliable AI-enabled robotic manipulation.

Abstract

The rapid advancement of generative AI and multi-modal foundation models has shown significant potential in advancing robotic manipulation. Vision-language-action (VLA) models, in particular, have emerged as a promising approach for visuomotor control by leveraging large-scale vision-language data and robot demonstrations. However, current VLA models are typically evaluated using a limited set of hand-crafted scenes, leaving their general performance and robustness in diverse scenarios largely unexplored. To address this gap, we present VLATest, a fuzzing framework designed to generate robotic manipulation scenes for testing VLA models. Based on VLATest, we conducted an empirical study to assess the performance of seven representative VLA models. Our study results revealed that current VLA models lack the robustness necessary for practical deployment. Additionally, we investigated the impact of various factors, including the number of confounding objects, lighting conditions, camera poses, unseen objects, and task instruction mutations, on the VLA model's performance. Our findings highlight the limitations of existing VLA models, emphasizing the need for further research to develop reliable and trustworthy VLA applications.

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

TL;DR

Abstract

Paper Structure (25 sections, 1 equation, 8 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 1 equation, 8 figures, 8 tables, 1 algorithm.

Introduction
Background: Vision-Language-Action Models
VLA Model for Robotic Manipulation
Training and Evaluation of the VLA Model
VLATest
Operators
Testing Scene Generation
Empirical Study
Research Questions
Subject VLA Models
Robotic Manipulation Tasks
Prompt Templates
Implementation Details
Results
RQ1: How Do VLA Models Perform in Popular Robotic Manipulation Tasks?
...and 10 more sections

Figures (8)

Figure 1: Architecture and workflow of a VLA model. (a) Generating robot actions at the first timestamp. (b) Generating robot actions at the k-th timestamp.
Figure 2: Operators used for testing VLA models in VLATest.
Figure 3: (RQ2) VLA performance vs. the number of confounding objects.
Figure 4: (RQ2) The number of successful scenes at different steps vs. the number of confounding objects.
Figure 5: RT-1-X perform Task 1 with different number of confounding objects.
...and 3 more figures

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

TL;DR

Abstract

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)