Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

Yihao Zhang; Yuankai Qi; Xi Zheng

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

Yihao Zhang, Yuankai Qi, Xi Zheng

TL;DR

This work presents the first systematic, real-world benchmark of Vision–Language–Action (VLA) models for dual-arm robotic manipulation across four tasks. It compares a specialist imitation policy (ACT) with three generalist VLA foundation models (OpenVLA–OFT, RDT-1B, π0) in both real-world and simulation settings under in-distribution and out-of-distribution conditions, using a standardized framework that measures accuracy, adaptability, and instruction grounding. Key findings show that π0 offers the strongest cross-domain generalization and robustness to distribution shifts, while ACT remains the most stable in-distribution but is brittle under OOD; OpenVLA–OFT and RDT-1B lag behind without task-specific tuning. The study also develops a detailed failure taxonomy and diagnostic workflow, revealing predominant error modes such as near-miss grasps, improper release timing, and long-horizon state drift, and argues for tighter neuro-symbolic integration to improve interpretability and reliability in real-world robotic manipulation.

Abstract

Foundation models applied in robotics, particularly \textbf{Vision--Language--Action (VLA)} models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \textbf{empirical experiences} from benchmarking four representative VLAs -- \textbf{ACT}, \textbf{OpenVLA--OFT}, \textbf{RDT-1B}, and \boldmath{$π_0$} -- across four manipulation tasks conducted in both simulation and on the \textbf{ALOHA Mobile} platform. We establish a \textbf{standardized evaluation framework} that measures performance along three key dimensions: (1) \textit{accuracy and efficiency} (success rate and time-to-success), (2) \textit{adaptability} across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \textit{language instruction-following accuracy}. Through this process, we observe that \boldmath{$π_0$} demonstrates superior adaptability in out-of-distribution scenarios, while \textbf{ACT} provides the highest stability in-distribution. Further analysis highlights differences in computational demands, data-scaling behavior, and recurring failure modes such as near-miss grasps, premature releases, and long-horizon state drift. These findings reveal practical trade-offs among VLA model architectures in balancing precision, generalization, and deployment cost, offering actionable insights for selecting and deploying VLAs in real-world robotic manipulation tasks.

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

TL;DR

Abstract

} -- across four manipulation tasks conducted in both simulation and on the \textbf{ALOHA Mobile} platform. We establish a \textbf{standardized evaluation framework} that measures performance along three key dimensions: (1) \textit{accuracy and efficiency} (success rate and time-to-success), (2) \textit{adaptability} across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \textit{language instruction-following accuracy}. Through this process, we observe that \boldmath{

} demonstrates superior adaptability in out-of-distribution scenarios, while \textbf{ACT} provides the highest stability in-distribution. Further analysis highlights differences in computational demands, data-scaling behavior, and recurring failure modes such as near-miss grasps, premature releases, and long-horizon state drift. These findings reveal practical trade-offs among VLA model architectures in balancing precision, generalization, and deployment cost, offering actionable insights for selecting and deploying VLAs in real-world robotic manipulation tasks.

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

TL;DR

Abstract

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)