Table of Contents
Fetching ...

Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

Wenxuan Song, Jiayi Chen, Xiaoquan Sun, Huashuo Lei, Yikai Qin, Wei Zhao, Pengxiang Ding, Han Zhao, Tongxin Wang, Pengxu Hou, Zhide Zhong, Haodong Yan, Donglin Wang, Jun Ma, Haoang Li

TL;DR

LLaVA-VLA is introduced, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs that adopts a two-stage training paradigm including post-training and fine-tuning, and extends the action space to unify navigation and manipulation.

Abstract

Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LLaVA-VLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it integrates a compact VLM backbone with multi-view perception, proprioceptive tokenization, and action chunking. To eliminate reliance on costly pre-training, LLaVA-VLA adopts a two-stage training paradigm including post-training and fine-tuning. Furthermore, LLaVA-VLA extends the action space to unify navigation and manipulation. Experiments across embodiments demonstrate the capabilities of generalization and versatility of LLaVA-VLA , while real-world mobile manipulation experiments establish it as the first end-to-end VLA model for mobile manipulation. We will open-source all datasets, codes, and checkpoints upon acceptance to foster reproducibility and future research.

Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

TL;DR

LLaVA-VLA is introduced, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs that adopts a two-stage training paradigm including post-training and fine-tuning, and extends the action space to unify navigation and manipulation.

Abstract

Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LLaVA-VLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it integrates a compact VLM backbone with multi-view perception, proprioceptive tokenization, and action chunking. To eliminate reliance on costly pre-training, LLaVA-VLA adopts a two-stage training paradigm including post-training and fine-tuning. Furthermore, LLaVA-VLA extends the action space to unify navigation and manipulation. Experiments across embodiments demonstrate the capabilities of generalization and versatility of LLaVA-VLA , while real-world mobile manipulation experiments establish it as the first end-to-end VLA model for mobile manipulation. We will open-source all datasets, codes, and checkpoints upon acceptance to foster reproducibility and future research.
Paper Structure (20 sections, 5 figures, 11 tables)

This paper contains 20 sections, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Overview of this work. We conduct a comprehensive study on the practicality of vision-language-action models. We first construct a cross-embodiment benchmark, CEBench, across simulation and the real world, and offer diverse evaluation settings. Then we explore three critical aspects (Q1-Q3) and offer several key findings. Based on the above findings, we introduce our LLaVA-VLA, a lightweight yet effective baseline capable of mobile manipulation.
  • Figure 2: Real-world setup of the Cobot-Magic system for mobile bimanual manipulation (top view).
  • Figure 3: Model architecture of our LLaVA-VLA.
  • Figure 4: Visualization of real-world tasks. The top two rows illustrate the seen tasks, while the bottom two rows correspond to settings with domain randomization. Out of the eight real-world tasks, we select two representative examples of single-arm manipulation (left) as well as two examples of bimanual collaboration (right).
  • Figure 5: Evaluation in real-world mobile manipulation tasks.