Table of Contents
Fetching ...

ActionCodec: What Makes for Good Action Tokenizers

Zibin Dong, Yicheng Liu, Shiduo Zhang, Baijun Ye, Yifu Yuan, Fei Ni, Jingjing Gong, Xipeng Qiu, Hang Zhao, Yinchuan Li, Jianye Hao

TL;DR

ActionCodec systematically analyzes action tokenizers for Vision-Language-Action (VLA) models and derives four design principles centered on information-theoretic objectives: maximize temporal overlap, minimize vocabulary redundancy, maximize perceptual alignment with multimodal inputs, and minimize residual grammar. It then presents ActionCodec, a Perceiver-style VQ tokenizer augmented with embodiment-specific soft prompts and RVQ post-training, achieving state-of-the-art results on LIBERO without robotics pre-training and strong real-world performance. Through extensive ablations and integration with multiple VLA paradigms (PD, KI, BAR), the work shows that tokenizer design critically shapes training efficiency and generalization, often more than model scale. The released approach provides concrete methodological guidance and practical benchmarks to advance discrete action representations for scalable physical intelligence.

Abstract

Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance across diverse simulation and real-world benchmarks. Notably, on LIBERO, a SmolVLM2-2.2B fine-tuned with ActionCodec achieves a 95.5\% success rate without any robotics pre-training. With advanced architectural enhancements, this reaches 97.4\%, representing a new SOTA for VLA models without robotics pre-training. We believe our established design principles, alongside the released model, will provide a clear roadmap for the community to develop more effective action tokenizers.

ActionCodec: What Makes for Good Action Tokenizers

TL;DR

ActionCodec systematically analyzes action tokenizers for Vision-Language-Action (VLA) models and derives four design principles centered on information-theoretic objectives: maximize temporal overlap, minimize vocabulary redundancy, maximize perceptual alignment with multimodal inputs, and minimize residual grammar. It then presents ActionCodec, a Perceiver-style VQ tokenizer augmented with embodiment-specific soft prompts and RVQ post-training, achieving state-of-the-art results on LIBERO without robotics pre-training and strong real-world performance. Through extensive ablations and integration with multiple VLA paradigms (PD, KI, BAR), the work shows that tokenizer design critically shapes training efficiency and generalization, often more than model scale. The released approach provides concrete methodological guidance and practical benchmarks to advance discrete action representations for scalable physical intelligence.

Abstract

Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance across diverse simulation and real-world benchmarks. Notably, on LIBERO, a SmolVLM2-2.2B fine-tuned with ActionCodec achieves a 95.5\% success rate without any robotics pre-training. With advanced architectural enhancements, this reaches 97.4\%, representing a new SOTA for VLA models without robotics pre-training. We believe our established design principles, alongside the released model, will provide a clear roadmap for the community to develop more effective action tokenizers.
Paper Structure (24 sections, 6 equations, 15 figures, 9 tables)

This paper contains 24 sections, 6 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: ActionCodec provides a comprehensive analysis of the VQ action tokenizer design elements that directly impact VLA training and summarizes the best practices. When utilized for the autoregressive fine-tuning of SmolVLM2-2.2B without additional architectural designs, ActionCodec achieves performance on the LIBERO benchmark that far exceeds other action tokenizers, particularly in terms of training efficiency.
  • Figure 2: Neural Network Architecture of ActionCodec. We employ a Perceiver-like transformer architecture due to its inherent flexibility, which facilitates the modeling of diverse token relations and supports the encoding of variable-length action sequences.
  • Figure 3: LIBERO-Goal results for different design choices. All VLA models are based on the SmolVLM2-256M backbone, following vocabulary expansion and full-parameter fine-tuning without additional architectural modifications. The suffix notations are defined as follows: acc (L1 reconstruction error), tok (token budget), cb (vocabulary size), OR (overlap rate), SA (self-attention), Causal (SA w/ causal mask), and SP (training w/ soft-prompt).
  • Figure 4: t-SNE visualization of the VLA's last hidden states for the action $\texttt{[BOS]}$ token on four LIBERO-Goal tasks.
  • Figure 5: (Left) VLA attention maps for CLIP vs. TCL-trained tokenizers. (Right) t-SNE visualization of the structured latent space across 40 LIBERO tasks.
  • ...and 10 more figures