Table of Contents
Fetching ...

Bring the Apple, Not the Sofa: Impact of Irrelevant Context in Embodied AI Commands on VLA Models

Daria Pugacheva, Andrey Moskalenko, Denis Shepelev, Andrey Kuznetsov, Vlad Shakhuro, Elena Tutubalina

TL;DR

This work systematically probes the robustness of Vision-Language-Action (VLA) models to linguistic perturbations, focusing on irrelevant context and natural paraphrasing. It introduces a diverse perturbation framework (context-length variations and semantic/lexical proximity) and evaluates five VLA models across LIBERO and Habitat 2.0, revealing substantial degradation as noise increases, with semantically similar noise causing the largest drops. A key contribution is an LLM-based filtering framework that extracts core commands from noisy inputs, achieving up to 98.5% recovery of original performance and substantial gains across benchmarks. The findings emphasize the gap between real-world language variability and current VLA training regimes, and offer a practical mitigation path for safer, more reliable embodied agents in realistic settings.

Abstract

Vision Language Action (VLA) models are widely used in Embodied AI, enabling robots to interpret and execute language instructions. However, their robustness to natural language variability in real-world scenarios has not been thoroughly investigated. In this work, we present a novel systematic study of the robustness of state-of-the-art VLA models under linguistic perturbations. Specifically, we evaluate model performance under two types of instruction noise: (1) human-generated paraphrasing and (2) the addition of irrelevant context. We further categorize irrelevant contexts into two groups according to their length and their semantic and lexical proximity to robot commands. In this study, we observe consistent performance degradation as context size expands. We also demonstrate that the model can exhibit relative robustness to random context, with a performance drop within 10%, while semantically and lexically similar context of the same length can trigger a quality decline of around 50%. Human paraphrases of instructions lead to a drop of nearly 20%. To mitigate this, we propose an LLM-based filtering framework that extracts core commands from noisy inputs. Incorporating our filtering step allows models to recover up to 98.5% of their original performance under noisy conditions.

Bring the Apple, Not the Sofa: Impact of Irrelevant Context in Embodied AI Commands on VLA Models

TL;DR

This work systematically probes the robustness of Vision-Language-Action (VLA) models to linguistic perturbations, focusing on irrelevant context and natural paraphrasing. It introduces a diverse perturbation framework (context-length variations and semantic/lexical proximity) and evaluates five VLA models across LIBERO and Habitat 2.0, revealing substantial degradation as noise increases, with semantically similar noise causing the largest drops. A key contribution is an LLM-based filtering framework that extracts core commands from noisy inputs, achieving up to 98.5% recovery of original performance and substantial gains across benchmarks. The findings emphasize the gap between real-world language variability and current VLA training regimes, and offer a practical mitigation path for safer, more reliable embodied agents in realistic settings.

Abstract

Vision Language Action (VLA) models are widely used in Embodied AI, enabling robots to interpret and execute language instructions. However, their robustness to natural language variability in real-world scenarios has not been thoroughly investigated. In this work, we present a novel systematic study of the robustness of state-of-the-art VLA models under linguistic perturbations. Specifically, we evaluate model performance under two types of instruction noise: (1) human-generated paraphrasing and (2) the addition of irrelevant context. We further categorize irrelevant contexts into two groups according to their length and their semantic and lexical proximity to robot commands. In this study, we observe consistent performance degradation as context size expands. We also demonstrate that the model can exhibit relative robustness to random context, with a performance drop within 10%, while semantically and lexically similar context of the same length can trigger a quality decline of around 50%. Human paraphrases of instructions lead to a drop of nearly 20%. To mitigate this, we propose an LLM-based filtering framework that extracts core commands from noisy inputs. Incorporating our filtering step allows models to recover up to 98.5% of their original performance under noisy conditions.

Paper Structure

This paper contains 24 sections, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Human-voiced commands to the robot may contain irrelevant context and cause the target command to fail. We observed a significant drop in the success rates of VLA robotic models when real users posed problems.
  • Figure 2: The instruction that was shown to workers during crowdsourcing.
  • Figure 4: Success rates for LLARP in the Habitat 2.0 simulator for commands with different types of irrelevant context after filtering by LLMs of various sizes using a few-shot prompt.
  • Figure 5: Ratio of recovered commands from the LIBERO benchmark averaged across task suites and all types of irrelevant context
  • Figure 6: Examples of instructions with 3 different types of irrelevant context used in the filtering framework.
  • ...and 7 more figures