Table of Contents
Fetching ...

VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, Tong He

TL;DR

This work tackles the scalability and efficiency of Vision-Language-Action (VLA) models by introducing a convolutional residual VQ-VAE as a general action tokenizer. Trained on an order of magnitude more data than prior approaches and integrated into VLA with layer-offset token IDs, the tokenizer enables longer action sequences, faster inference, and improved long-horizon planning. The approach yields strong gains in real-world robotic tasks and demonstrates a small sim-to-real gap, thanks to extensive synthetic data and progressive training. Overall, the method significantly enhances both the effectiveness and practicality of embodied intelligence systems across simulated and real environments.

Abstract

In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly-most notably, achieving up to a 30% higher success rate on two real-world tasks in long-horizon scenarios. These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application domains.Project website: https://xiaoxiao0406.github.io/vqvla.github.io

VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

TL;DR

This work tackles the scalability and efficiency of Vision-Language-Action (VLA) models by introducing a convolutional residual VQ-VAE as a general action tokenizer. Trained on an order of magnitude more data than prior approaches and integrated into VLA with layer-offset token IDs, the tokenizer enables longer action sequences, faster inference, and improved long-horizon planning. The approach yields strong gains in real-world robotic tasks and demonstrates a small sim-to-real gap, thanks to extensive synthetic data and progressive training. Overall, the method significantly enhances both the effectiveness and practicality of embodied intelligence systems across simulated and real environments.

Abstract

In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly-most notably, achieving up to a 30% higher success rate on two real-world tasks in long-horizon scenarios. These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application domains.Project website: https://xiaoxiao0406.github.io/vqvla.github.io

Paper Structure

This paper contains 23 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The VQ-VLA pipeline, consisting of two main stages: (1) training a general convolutional residual VQ-VAE and (2) fine-tuning OpenVLA using the LoRA approach. Specifically, a general convolutional residual VQ-VAE is first trained on the Open X-Embodiment dataset, LIEBRO, and ManiSkill datasets. The trained VQ-VAE is then frozen and serves as an action tokenizer for OpenVLA, replacing the simple binning method. In the second stage, OpenVLA is fine-tuned using the LoRA technique to optimize its performance.
  • Figure 2: All Evaluation environments:We conduct comprehensive evaluations of VQ-VLA in both simulation and real-world settings. In simulation, evaluations are performed on the LIBERO-90 benchmark within the LIBERO dataset. And six diverse tasks are designed for real-world testing.
  • Figure 3: Real-world experimental results: We compare the performance of Baseline, VQO, VQO+L, and VQO+L+M on both short-horizon and long-horizon tasks. In terms of the average success rate, all VQ-based models outperform the Baseline. The best-performing model, VQO+L+M, achieves a success rate that is 23.25% higher than the Baseline on both short-horizon and long-horizon tasks. Additionally, the results show that VQO+L+M outperforms VQO+L, which in turn outperforms VQO, indicating the effectiveness of incorporating synthetic data during training without compromising real-world performance.