Table of Contents
Fetching ...

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, Lichao Sun

TL;DR

This work identifies fundamental flaws in LIBERO's evaluation protocol, where high scores largely reflect memorization rather than genuine generalization. It proposes LIBERO-PRO, a perturbation-based benchmark that stress-tests VLA models across manipulated objects, initial states, instructions, and environments, with randomized combinations and clear constraints. Through experiments on OpenVLA, pi0, and pi0.5, LIBERO-PRO reveals pronounced robustness gaps and demonstrates that standard LIBERO metrics do not reflect real-world capabilities. The authors advocate adopting LIBERO-PRO for fairer, more reliable evaluation and provide the code at https://github.com/Zxy-MLlab/LIBERO-PRO.

Abstract

LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

TL;DR

This work identifies fundamental flaws in LIBERO's evaluation protocol, where high scores largely reflect memorization rather than genuine generalization. It proposes LIBERO-PRO, a perturbation-based benchmark that stress-tests VLA models across manipulated objects, initial states, instructions, and environments, with randomized combinations and clear constraints. Through experiments on OpenVLA, pi0, and pi0.5, LIBERO-PRO reveals pronounced robustness gaps and demonstrates that standard LIBERO metrics do not reflect real-world capabilities. The authors advocate adopting LIBERO-PRO for fairer, more reliable evaluation and provide the code at https://github.com/Zxy-MLlab/LIBERO-PRO.

Abstract

LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.

Paper Structure

This paper contains 17 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Model trajectory consistency under object and instruction modifications. Original task” denotes the standard evaluation, while “Replace,” “Change Pos.,” “Remove,” and “Messy instruction” introduce perturbations to the target object or instruction. Across all settings, the model yields nearly identical trajectories, suggesting a lack of genuine task understanding and environmental perception.
  • Figure 2: Success rates of OpenVLA, pi0, pi0.5, and univla under object position perturbations.
  • Figure 3: LIBERO-PRO benchmark task overview. We extend the four original task categories in LIBERO by introducing object attribute perturbations, initial position perturbations, instruction perturbations, and environmental perturbations.
  • Figure 4: Pick up salad dressing and place it in basket.
  • Figure 5: Pick up salad dressing and place it in basket.
  • ...and 2 more figures