Table of Contents
Fetching ...

OAT: Ordered Action Tokenization

Chaoqi Liu, Xiaoshen Han, Jiawei Gao, Yue Zhao, Haonan Chen, Yilun Du

TL;DR

The paper tackles how to discretize continuous robot actions for autoregressive policies by formalizing three core desiderata: high compression, total decodability, and left-to-right causal ordering. It introduces Ordered Action Tokenization (OAT), a tokenizer that uses transformer-based register tokens, finite scalar quantization, and nested dropout to create an ordered, prefix-decodable token space that aligns with next-token prediction. Empirically, OAT outperforms naive binning, FAST, and latent-tokenizers across 20+ simulation and real-world tasks, offering a flexible anytime decoding capability that trades computation for action fidelity. The work demonstrates that token space ordering is a crucial inductive bias for stable, scalable autoregressive learning and suggests OAT as a versatile component for future robot learning pipelines and VLAs.

Abstract

Autoregressive policies offer a compelling foundation for scalable robot learning by enabling discrete abstraction, token-level reasoning, and flexible inference. However, applying autoregressive modeling to continuous robot actions requires an effective action tokenization scheme. Existing approaches either rely on analytical discretization methods that produce prohibitively long token sequences, or learned latent tokenizers that lack structure, limiting their compatibility with next-token prediction. In this work, we identify three desiderata for action tokenization - high compression, total decodability, and a left-to-right causally ordered token space - and introduce Ordered Action Tokenization (OAT), a learned action tokenizer that satisfies all three. OAT discretizes action chunks into an ordered sequence of tokens using transformer with registers, finite scalar quantization, and ordering-inducing training mechanisms. The resulting token space aligns naturally with autoregressive generation and enables prefix-based detokenization, yielding an anytime trade-off between inference cost and action fidelity. Across more than 20 tasks spanning four simulation benchmarks and real-world settings, autoregressive policies equipped with OAT consistently outperform prior tokenization schemes and diffusion-based baselines, while offering significantly greater flexibility at inference time.

OAT: Ordered Action Tokenization

TL;DR

The paper tackles how to discretize continuous robot actions for autoregressive policies by formalizing three core desiderata: high compression, total decodability, and left-to-right causal ordering. It introduces Ordered Action Tokenization (OAT), a tokenizer that uses transformer-based register tokens, finite scalar quantization, and nested dropout to create an ordered, prefix-decodable token space that aligns with next-token prediction. Empirically, OAT outperforms naive binning, FAST, and latent-tokenizers across 20+ simulation and real-world tasks, offering a flexible anytime decoding capability that trades computation for action fidelity. The work demonstrates that token space ordering is a crucial inductive bias for stable, scalable autoregressive learning and suggests OAT as a versatile component for future robot learning pipelines and VLAs.

Abstract

Autoregressive policies offer a compelling foundation for scalable robot learning by enabling discrete abstraction, token-level reasoning, and flexible inference. However, applying autoregressive modeling to continuous robot actions requires an effective action tokenization scheme. Existing approaches either rely on analytical discretization methods that produce prohibitively long token sequences, or learned latent tokenizers that lack structure, limiting their compatibility with next-token prediction. In this work, we identify three desiderata for action tokenization - high compression, total decodability, and a left-to-right causally ordered token space - and introduce Ordered Action Tokenization (OAT), a learned action tokenizer that satisfies all three. OAT discretizes action chunks into an ordered sequence of tokens using transformer with registers, finite scalar quantization, and ordering-inducing training mechanisms. The resulting token space aligns naturally with autoregressive generation and enables prefix-based detokenization, yielding an anytime trade-off between inference cost and action fidelity. Across more than 20 tasks spanning four simulation benchmarks and real-world settings, autoregressive policies equipped with OAT consistently outperform prior tokenization schemes and diffusion-based baselines, while offering significantly greater flexibility at inference time.
Paper Structure (34 sections, 8 equations, 6 figures, 6 tables, 2 algorithms)

This paper contains 34 sections, 8 equations, 6 figures, 6 tables, 2 algorithms.

Figures (6)

  • Figure 1: Left: Comparison of action tokenization schemes with respect to three desiderata: high compression (P.1), total decodability (P.2), and left-to-right causally ordered token structure (P.3). Existing methods satisfy only subsets of these properties, while OAT uniquely satisfies all three. Middle: Behavior of different policy classes as inference progresses. Diffusion and flow policies refine actions through iterative sampling, while autoregressive policies generate discrete tokens step-by-step. Due to its ordered token space, OAT enables prefix-based detokenization: early tokens produce coarse action chunks, and additional autoregressive steps progressively refine actions, enabling flexible, anytime action generation. Right: Overall policy performance aggregated over 20+ tasks.
  • Figure 2: Coarse-to-fine action chunk reconstruction. Visualization of reconstructed action chunks using increasing numbers of decoded tokens. Panels (a–d) show OAT reconstructions using $K \in \{1,2,4,8\}$ tokens, respectively, while (e) shows the ground-truth action chunk. Earlier tokens capture the coarse, global structure of the motion, while additional tokens progressively refine fine-grained details, yielding trajectories that increasingly match the ground truth. Ghosted poses indicate temporal progression within each reconstructed action chunk. Interactive visualization on project website: https://ordered-action-tokenization.github.io/.
  • Figure 3: OAT overview.Left:OAT maps a chunk of continuous actions into an ordered sequence of discrete tokens using a transformer encoder with register tokens, FSQ, and nested dropout to induce token ordering. The resulting tokens form a compact action representation, which is decoded to reconstruct action chunks for downstream autoregressive policies. Right: During OAT policy inference, tokens are generated autoregressively and can be detokenized from any prefix. As more autoregressive steps are taken, additional tokens progressively refine the decoded action chunk, producing actions with increasing temporal and spatial detail. OAT enables flexible, anytime action generation.
  • Figure 4: Simulation setups. We evaluate OAT across four widely used robotic manipulation benchmarks spanning diverse task structures and dynamics. These environments cover a range of skills, including object manipulation, tool use, and multi-stage interactions.
  • Figure 5: Effect of action and token horizons. Performance of OAT$_{H_l}$ on LIBERO as a function of action horizon $H_a$ (rows) and token horizon $H_l$ (columns). Results report mean success rates with standard error across 5 seeds and 50 evaluation rollouts per seed per task.
  • ...and 1 more figures