Table of Contents
Fetching ...

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung

Abstract

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

Abstract

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para

Paper Structure

This paper contains 62 sections, 4 equations, 29 figures, 16 tables, 1 algorithm.

Figures (29)

  • Figure 1: Illustration of paraphrase robustness gap under data-scarce fine-tuning: VLA models can overfit to seen instruction phrasings during fine-tuning and fail to generalize to paraphrased variants at deployment.
  • Figure 2: Overview of LIBERO-Para. Compared to LIBERO, LIBERO-Para evaluates paraphrase robustness under data-scarce fine-tuning via a controlled two-axis design (action vs. object), enabling interpretable analysis.
  • Figure 3: Examples of axis-specific paraphrases. Object variations modify target object references (e.g., same-polarity substitution, addition), while action variations cover lexical, structural, and pragmatic realizations grounded in established taxonomies.
  • Figure 4: $S_K$ (top) and $S_T$ (bottom) computation. $S_K$ is based on semantic matching between task-critical content words, while $S_T$ uses dependency-tree edit distance. Node colors indicate dependency relations: root (sentence root), dobj (direct object), pobj (object of preposition), and others (remaining types, simplified for visualization; all included in computation).
  • Figure 5: Average PRIDE score per Object × Action cell in LIBERO-Para (darker = harder). Scores increase along both axes, with the most indirect action types (Question, Hint) combined with object paraphrasing reaching the highest (SP-habitual × Question: 0.42).
  • ...and 24 more figures