Table of Contents
Fetching ...

Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical Variations

Hanqing Liu, Shouwei Ruan, Jiahuan Long, Junqi Wu, Jiacheng Hou, Huili Tang, Tingsong Jiang, Weien Zhou, Wen Yao

Abstract

Vision-Language-Action (VLA) models have emerged as promising solutions for robotic manipulation, yet their robustness to real-world physical variations remains critically underexplored. To bridge this gap, we propose Eva-VLA, the first unified framework to systematically evaluate the robustness of VLA models by formulating uncontrollable physical variations as continuous optimization problems. Specifically, our framework addresses two fundamental challenges in VLA models' physical robustness evaluation: 1) how to systematically characterize diverse physical perturbations encountered in real-world deployment while maintaining reproducibility, and 2) how to efficiently discover worst-case scenarios without incurring prohibitive real-world data collection costs. To tackle the first challenge, we decouple real-world variations into three key dimensions: 3D object transformations that affect spatial reasoning, illumination changes that challenge visual perception, and adversarial regions that disrupt scene understanding. For the second challenge, we introduce a continuous black-box optimization mechanism that maps these perturbations into a continuous parameter space, enabling the systematic exploration of worst-case scenarios. Extensive experiments validate the effectiveness of our approach. Notably, OpenVLA exhibits an average failure rate of over 90% across three physical variations on the LIBERO-Long task, exposing critical systemic fragilities. Furthermore, applying the generated worst-case scenarios during adversarial training quantifiably increases model robustness, validating the effectiveness of this approach. Our evaluation exposes the gap between laboratory and real-world conditions, while the Eva-VLA framework can serve as an effective data augmentation method to enhance the resilience of robotic manipulation systems.

Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical Variations

Abstract

Vision-Language-Action (VLA) models have emerged as promising solutions for robotic manipulation, yet their robustness to real-world physical variations remains critically underexplored. To bridge this gap, we propose Eva-VLA, the first unified framework to systematically evaluate the robustness of VLA models by formulating uncontrollable physical variations as continuous optimization problems. Specifically, our framework addresses two fundamental challenges in VLA models' physical robustness evaluation: 1) how to systematically characterize diverse physical perturbations encountered in real-world deployment while maintaining reproducibility, and 2) how to efficiently discover worst-case scenarios without incurring prohibitive real-world data collection costs. To tackle the first challenge, we decouple real-world variations into three key dimensions: 3D object transformations that affect spatial reasoning, illumination changes that challenge visual perception, and adversarial regions that disrupt scene understanding. For the second challenge, we introduce a continuous black-box optimization mechanism that maps these perturbations into a continuous parameter space, enabling the systematic exploration of worst-case scenarios. Extensive experiments validate the effectiveness of our approach. Notably, OpenVLA exhibits an average failure rate of over 90% across three physical variations on the LIBERO-Long task, exposing critical systemic fragilities. Furthermore, applying the generated worst-case scenarios during adversarial training quantifiably increases model robustness, validating the effectiveness of this approach. Our evaluation exposes the gap between laboratory and real-world conditions, while the Eva-VLA framework can serve as an effective data augmentation method to enhance the resilience of robotic manipulation systems.

Paper Structure

This paper contains 14 sections, 4 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Visualization of three categories of physical variations. Object 3D transformations through rotation parameters ($\alpha$, $\beta$, $\gamma$) that alter object 3D poses in the scene (Top). Illumination variations modeled as Gaussian falloff functions with parameters ($x$, $y$, $\sigma$, $I$) controlling illumination position, radius, and intensity (Middle). Adversarial patches with translation parameters ($\Delta x$, $\Delta y$) that introduce visual disruptions at critical locations in the scene Bottom).
  • Figure 2: Overview of the Proposed Eva-VLA Framework. To capture worst-case physical variations, discrete transformations in three critical domains are parameterized and their distributions optimized through a query-based method, maximizing prediction errors of vision-language-action models to reduce task success rates.
  • Figure 3: Qualitative results under three physical variations on OpenVLA-7B kim2024openvla fine-tuned on LIBERO liu2023libero. Three manipulation tasks are shown with original executions (top row) and adversarially perturbed executions (bottom row, highlighted in red) for each task. The 3D trajectory visualizations on the right demonstrate the end-effector paths before (green) and after (red) applying physical variations, illustrating how object 3D transformations (top), illumination variations (middle), and adversarial patches (bottom) respectively disrupt the robot's motion patterns and lead to task failures.
  • Figure 4: Ablation study on optimization steps. We evaluate task failure rates under three adversarial conditions across four LIBERO task suites (Spatial, Object, Goal, and Long). The x-axis represents optimization iterations (0-60), and the y-axis shows the failure rate. The overlaid lines demonstrate the increasing trend of failure rates as the optimization progresses.
  • Figure 5: Ablation study on the scale ratio of optimal distribution. We evaluate task failure rates under three adversarial conditions across four LIBERO task suites (Spatial, Object, Goal, and Long). The x-axis represents the scale ratio of optimal $\Sigma$, and the y-axis shows the failure rate.
  • ...and 2 more figures