Table of Contents
Fetching ...

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

Yihao Zhang, Yuankai Qi, Xi Zheng

TL;DR

This work presents the first systematic, real-world benchmark of Vision–Language–Action (VLA) models for dual-arm robotic manipulation across four tasks. It compares a specialist imitation policy (ACT) with three generalist VLA foundation models (OpenVLA–OFT, RDT-1B, π0) in both real-world and simulation settings under in-distribution and out-of-distribution conditions, using a standardized framework that measures accuracy, adaptability, and instruction grounding. Key findings show that π0 offers the strongest cross-domain generalization and robustness to distribution shifts, while ACT remains the most stable in-distribution but is brittle under OOD; OpenVLA–OFT and RDT-1B lag behind without task-specific tuning. The study also develops a detailed failure taxonomy and diagnostic workflow, revealing predominant error modes such as near-miss grasps, improper release timing, and long-horizon state drift, and argues for tighter neuro-symbolic integration to improve interpretability and reliability in real-world robotic manipulation.

Abstract

Foundation models applied in robotics, particularly \textbf{Vision--Language--Action (VLA)} models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \textbf{empirical experiences} from benchmarking four representative VLAs -- \textbf{ACT}, \textbf{OpenVLA--OFT}, \textbf{RDT-1B}, and \boldmath{$π_0$} -- across four manipulation tasks conducted in both simulation and on the \textbf{ALOHA Mobile} platform. We establish a \textbf{standardized evaluation framework} that measures performance along three key dimensions: (1) \textit{accuracy and efficiency} (success rate and time-to-success), (2) \textit{adaptability} across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \textit{language instruction-following accuracy}. Through this process, we observe that \boldmath{$π_0$} demonstrates superior adaptability in out-of-distribution scenarios, while \textbf{ACT} provides the highest stability in-distribution. Further analysis highlights differences in computational demands, data-scaling behavior, and recurring failure modes such as near-miss grasps, premature releases, and long-horizon state drift. These findings reveal practical trade-offs among VLA model architectures in balancing precision, generalization, and deployment cost, offering actionable insights for selecting and deploying VLAs in real-world robotic manipulation tasks.

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

TL;DR

This work presents the first systematic, real-world benchmark of Vision–Language–Action (VLA) models for dual-arm robotic manipulation across four tasks. It compares a specialist imitation policy (ACT) with three generalist VLA foundation models (OpenVLA–OFT, RDT-1B, π0) in both real-world and simulation settings under in-distribution and out-of-distribution conditions, using a standardized framework that measures accuracy, adaptability, and instruction grounding. Key findings show that π0 offers the strongest cross-domain generalization and robustness to distribution shifts, while ACT remains the most stable in-distribution but is brittle under OOD; OpenVLA–OFT and RDT-1B lag behind without task-specific tuning. The study also develops a detailed failure taxonomy and diagnostic workflow, revealing predominant error modes such as near-miss grasps, improper release timing, and long-horizon state drift, and argues for tighter neuro-symbolic integration to improve interpretability and reliability in real-world robotic manipulation.

Abstract

Foundation models applied in robotics, particularly \textbf{Vision--Language--Action (VLA)} models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \textbf{empirical experiences} from benchmarking four representative VLAs -- \textbf{ACT}, \textbf{OpenVLA--OFT}, \textbf{RDT-1B}, and \boldmath{} -- across four manipulation tasks conducted in both simulation and on the \textbf{ALOHA Mobile} platform. We establish a \textbf{standardized evaluation framework} that measures performance along three key dimensions: (1) \textit{accuracy and efficiency} (success rate and time-to-success), (2) \textit{adaptability} across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \textit{language instruction-following accuracy}. Through this process, we observe that \boldmath{} demonstrates superior adaptability in out-of-distribution scenarios, while \textbf{ACT} provides the highest stability in-distribution. Further analysis highlights differences in computational demands, data-scaling behavior, and recurring failure modes such as near-miss grasps, premature releases, and long-horizon state drift. These findings reveal practical trade-offs among VLA model architectures in balancing precision, generalization, and deployment cost, offering actionable insights for selecting and deploying VLAs in real-world robotic manipulation tasks.

Paper Structure

This paper contains 39 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Training time on our setup (days): ACT $\approx$0.17, $\pi_{0}$$\approx$2, OpenVLA--OFT and RDT-1B $\approx$21 each.
  • Figure 2: Real-world manipulation tasks on ALOHA Mobile.
  • Figure 3: Simulation snapshot showing the robot arm manipulating colored cubes in MuJoCo.
  • Figure 4: Real-world case studies aligned with our failure taxonomy (a–h). Many scenes are difficult to attribute to a single root cause. Ambiguities between perception precision, control timing, and state estimation make root-cause diagnosis non-trivial. We therefore complement qualitative video review with structured debugging (Sec. \ref{['subsec:diagnosis']}) and a taxonomy tree (Fig. \ref{['fig:taxonomy']}) that links each symptom to the sections where it is discussed.
  • Figure 5: Revised failure taxonomy in the original left-to-right layout: root at left, two branches for task-level symptoms and model-level factors.
  • ...and 2 more figures