Table of Contents
Fetching ...

LIBERO-X: Robustness Litmus for Vision-Language-Action Models

Guodong Wang, Chenkai Zhang, Qingjie Liu, Jinjin Zhang, Jiancheng Cai, Junjie Liu, Xinmin Liu

TL;DR

LIBERO-X presents a comprehensive benchmark for Vision-Language-Action models by coupling a hierarchical, multi-level evaluation with a high-diversity, teleoperation-based training dataset. The five-level test suite systematically increases spatial, topological, visual, and linguistic perturbations, while the multi-label diagnostics reveal detailed failure modes. Experimental results show significant performance drops as complexity grows, exposing persistent gaps in scene understanding and instruction grounding across representative VLA architectures. The framework offers a more faithful assessment of real-world robustness and provides practical directions for improving generalization and long-horizon planning in robotic manipulation.

Abstract

Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.

LIBERO-X: Robustness Litmus for Vision-Language-Action Models

TL;DR

LIBERO-X presents a comprehensive benchmark for Vision-Language-Action models by coupling a hierarchical, multi-level evaluation with a high-diversity, teleoperation-based training dataset. The five-level test suite systematically increases spatial, topological, visual, and linguistic perturbations, while the multi-label diagnostics reveal detailed failure modes. Experimental results show significant performance drops as complexity grows, exposing persistent gaps in scene understanding and instruction grounding across representative VLA architectures. The framework offers a more faithful assessment of real-world robustness and provides practical directions for improving generalization and long-horizon planning in robotic manipulation.

Abstract

Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.
Paper Structure (21 sections, 14 figures, 13 tables)

This paper contains 21 sections, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Comparison with related simulation benchmarks.(a): LIBERO lacks diversity by coupling scenes with homogeneous trajectories, inducing action overfitting, while its test set closely mirrors training data. (b): Recent extensions reuse original training trajectories, introducing perturbations but model them independently, failing to capture complex distribution shifts. (c): LIBERO-X enhances training via multi-task scenes with diverse trajectories. Its multi-level evaluation protocol progressively escalates complexity, revealing significant performance degradation across VLA models as difficulty increases, thus enabling rigorous robustness assessment.
  • Figure 2: Overview of LIBERO-X. LIBERO-X provides a high-diversity training dataset constructed through human teleoperation, along with multi-level and multi-label evaluation data.
  • Figure 3: Distribution visualization of training trajectories.
  • Figure 4: Practical examples from the LIBERO-X training dataset.
  • Figure 5: Success Rate Decline across Multi-level Evaluation.
  • ...and 9 more figures