Table of Contents
Fetching ...

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, Xipeng Qiu

TL;DR

VLABench introduces a large-scale, open-source benchmark for language-conditioned robotics manipulation that targets long-horizon reasoning and world-knowledge transfer. It defines 100 task categories (60 primitive, 40 composite) and over 2,000 objects, evaluated across VLAs, hybrid workflows, and VLMs, using automated data generation via a DSL and domain randomization in a MuJoCo-based simulator. The work reveals substantial gaps in current VLAs and VLMs’ abilities to generalize, plan, and reason in embodied settings, while offering a scalable framework and metrics (including Progress Score and PM) to drive future progress. By providing standardized datasets, an instruction-augmentation pipeline, and cross-embodiment assets, VLABench aims to accelerate robust, generic language-conditioned robotic manipulation research.

Abstract

General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-Language-Action models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh\&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-the-art pretrained VLAs and the workflow based on VLMs face challenges in our tasks.

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

TL;DR

VLABench introduces a large-scale, open-source benchmark for language-conditioned robotics manipulation that targets long-horizon reasoning and world-knowledge transfer. It defines 100 task categories (60 primitive, 40 composite) and over 2,000 objects, evaluated across VLAs, hybrid workflows, and VLMs, using automated data generation via a DSL and domain randomization in a MuJoCo-based simulator. The work reveals substantial gaps in current VLAs and VLMs’ abilities to generalize, plan, and reason in embodied settings, while offering a scalable framework and metrics (including Progress Score and PM) to drive future progress. By providing standardized datasets, an instruction-augmentation pipeline, and cross-embodiment assets, VLABench aims to accelerate robust, generic language-conditioned robotic manipulation research.

Abstract

General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-Language-Action models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh\&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-the-art pretrained VLAs and the workflow based on VLMs face challenges in our tasks.

Paper Structure

This paper contains 33 sections, 6 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Overview of VLABench. VLABench is a large-scale language-conditioned manipulation benchmark to evaluate the comprehensive skill learning and generalization ability of action policies especially pre-trained vision-language-action models.
  • Figure 2: Long horizon task requiring reasoning. This task involves a request for a latte in an interactive scenario. The agent needs to recognize the requirement for coffee with milk and integrate multiple skills, including picking, placing, tool use, pressing, and pouring.
  • Figure 3: Task examples in each dimension. The first row showcases examples of primitive tasks from Section \ref{['sec:task_des']}, while the second row presents examples of composite tasks.
  • Figure 4: Evaluation results for Voxposer and CoPA. Voxposer w/o refers to the version without visual perception, where ground truth labels are directly provided for object selection. Voxposer w uses GPT-4V as the visual perception module.
  • Figure 5: Radar charts depicting the performance of all VLM models across six dimensions. The reason why only GLM-4V-9B is evaluated in a zero-shot setting is that it does not support multi-graph inference, which is required for the other models.
  • ...and 13 more figures