Table of Contents
Fetching ...

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

Jiaming Zhou, Ke Ye, Jiayi Liu, Teli Ma, Zifan Wang, Ronghe Qiu, Kun-Yu Lin, Zhilin Zhao, Junwei Liang

TL;DR

This work targets zero-shot cross-task generalization in vision-language-action robotic manipulation by introducing AGNOSTOS, a RLBench-based benchmark with 23 unseen tasks across two difficulty levels. To close the generalization gap, it proposes Cross-task In-context Manipulation (X-ICM), which leverages in-context demonstrations from seen tasks guided by a diffusion-based dynamics model to prompt LLMs to predict unseen-task actions. Across extensive benchmarks, X-ICM substantially improves cross-task zero-shot performance over leading VLA approaches, with notable gains on Level-1 and Level-2 tasks and some real-world validation. Still, challenges remain for novel semantics and long-horizon tasks, motivating further work in multi-modal reasoning and embodiment-agnostic generalization for open-world robotic manipulation.

Abstract

The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

TL;DR

This work targets zero-shot cross-task generalization in vision-language-action robotic manipulation by introducing AGNOSTOS, a RLBench-based benchmark with 23 unseen tasks across two difficulty levels. To close the generalization gap, it proposes Cross-task In-context Manipulation (X-ICM), which leverages in-context demonstrations from seen tasks guided by a diffusion-based dynamics model to prompt LLMs to predict unseen-task actions. Across extensive benchmarks, X-ICM substantially improves cross-task zero-shot performance over leading VLA approaches, with notable gains on Level-1 and Level-2 tasks and some real-world validation. Still, challenges remain for novel semantics and long-horizon tasks, motivating further work in multi-modal reasoning and embodiment-agnostic generalization for open-world robotic manipulation.

Abstract

The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.

Paper Structure

This paper contains 24 sections, 5 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: The proposed AGNOSTOS benchmark evaluates zero-shot cross-task generalization through two difficulty levels. Level-1 testing involves 13 unseen tasks sharing partial similarity (objects or motions) with seen tasks. Level-2 testing has 10 unseen tasks from entirely novel scenarios, requiring stronger generalization capabilities. We systematically assess three broad categories of vision-language-action models, revealing critical limits in their ability to adapt to unseen tasks.
  • Figure 2: X-ICM Method Overview. X-ICM employs a dynamics-guided sample selection module to retrieve effective demonstrations from seen tasks for each tested unseen task. These demonstrations are then used by the cross-task in-context prediction module to construct the prompt that drives the LLM to predict the corresponding action sequence.
  • Figure 3: Effects of dynamics-guided sample selection module and different model sizes.
  • Figure 4: Results of five real-world tasks. The tests are conducted in a zero-shot cross-task manner.
  • Figure A1: Examples of 18 widely used training (seen) tasks on RLBench.
  • ...and 6 more figures