FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization

Yicheng Liu; Shiduo Zhang; Zibin Dong; Baijun Ye; Tianyuan Yuan; Xiaopeng Yu; Linqi Yin; Chenhao Lu; Junhao Shi; Luca Jiang-Tao Yu; Liangtao Zheng; Tao Jiang; Jingjing Gong; Xipeng Qiu; Hang Zhao

FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization

Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Tao Jiang, Jingjing Gong, Xipeng Qiu, Hang Zhao

TL;DR

FASTer tackles the bottleneck of action tokenization in autoregressive Vision-Language-Action models by introducing FASTerVQ, a high‑compression, high‑fidelity action tokenizer, and FASTerVLA, an efficient BAR-enabled autoregressive policy with a lightweight action expert. The approach achieves near-lossless action reconstruction with significantly fewer tokens and demonstrates strong generalization across embodiments and backbones. Empirical results show state-of-the-art in-distribution performance and robust out-of-distribution transfer, with substantial reductions in inference latency compared to prior AR VLA methods. Overall, FASTer provides a scalable, transferable framework for efficient multimodal robotic control that can leverage pretrained VLM backbones without retraining.

Abstract

Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.

FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization

TL;DR

Abstract

FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)