Table of Contents
Fetching ...

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Irving Fang, Juexiao Zhang, Shengbang Tong, Chen Feng

TL;DR

This work introduces INT-ACT, a 50-task, simulation-based probing suite to systematically evaluate Vision-Language-Action models across language complexity, object diversity, and vision-language reasoning. By benchmarking four prominent VLA architectures on BridgeV2 within INT-ACT, the authors uncover a robust Intention-Action Gap: VLAs often display correct high-level intentions under distribution shifts but falter in precise motor execution. The study also shows that fine-tuning can erode the underlying VLM's generalist capabilities, and that language perturbations and multimodal distractions degrade performance, indicating fragile multimodal generalization. The authors provide open-source task suites and evaluation code to standardize future VLA research and encourage methods that close the perception-to-action gap, with implications for designing more robust embodied AI systems.

Abstract

One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at https://ai4ce.github.io/INT-ACT/

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

TL;DR

This work introduces INT-ACT, a 50-task, simulation-based probing suite to systematically evaluate Vision-Language-Action models across language complexity, object diversity, and vision-language reasoning. By benchmarking four prominent VLA architectures on BridgeV2 within INT-ACT, the authors uncover a robust Intention-Action Gap: VLAs often display correct high-level intentions under distribution shifts but falter in precise motor execution. The study also shows that fine-tuning can erode the underlying VLM's generalist capabilities, and that language perturbations and multimodal distractions degrade performance, indicating fragile multimodal generalization. The authors provide open-source task suites and evaluation code to standardize future VLA research and encourage methods that close the perception-to-action gap, with implications for designing more robust embodied AI systems.

Abstract

One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at https://ai4ce.github.io/INT-ACT/

Paper Structure

This paper contains 34 sections, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Left: Examples of tasks with out-of-distribution objects. Right: Examples of tasks with commonsense reasoning, distractors, and commonsense reasoning + distractors.
  • Figure 2: Illustration of the INT-ACT probing suite. Left: suite category breakdown. Right: illustraion of language variations.
  • Figure 3: The Intention-Action Gap illustrated by comparing the two radar maps. Task Success Radar on the left, Intention Correctness Radar on the left. Best viewed in color.
  • Figure 4: OOD generalization results. Out-of-Distribution objects are painted in orange.
  • Figure 5: Case studies showing the impact of visual distractions and language commonsense variations. Task illustrations are on the left. Success rate, Intention Correctness and Wrong Object Attemp Rate are grouped by models on the right.
  • ...and 2 more figures