Table of Contents
Fetching ...

BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning

Hongyi Zhou, Weiran Liao, Xi Huang, Yucheng Tang, Fabian Otto, Xiaogang Jia, Xinkai Jiang, Simon Hilber, Ge Li, Qian Wang, Ömer Erdinç Yağmurlu, Nils Blank, Moritz Reuss, Rudolf Lioutikov

TL;DR

BEAST tackles the inefficiency of generating high-frequency continuous robot actions by introducing a B-spline encoded action sequence tokenizer that yields fixed-length tokens without requiring tokenizer training. By representing trajectories with B-spline control points and enabling parallel decoding, BEAST achieves fast inference while ensuring smooth transitions between action chunks. The approach is demonstrated across discrete and continuous token variants (BEAST-F, BEAST-D, BEAST-ACT) and multiple architectures, with strong performance on simulation benchmarks and real-world robots, and notable gains in training efficiency. This tokenizer offers a scalable, plug-and-play primitive for imitation learning in robotics, potentially enabling more responsive and robust autonomous manipulation without extensive tokenizer-induced training complexity.

Abstract

We present the B-spline Encoded Action Sequence Tokenizer (BEAST), a novel action tokenizer that encodes action sequences into compact discrete or continuous tokens using B-splines. In contrast to existing action tokenizers based on vector quantization or byte pair encoding, BEAST requires no separate tokenizer training and consistently produces tokens of uniform length, enabling fast action sequence generation via parallel decoding. Leveraging our B-spline formulation, BEAST inherently ensures generating smooth trajectories without discontinuities between adjacent segments. We extensively evaluate BEAST by integrating it with three distinct model architectures: a Variational Autoencoder (VAE) with continuous tokens, a decoder-only Transformer with discrete tokens, and Florence-2, a pretrained Vision-Language Model with an encoder-decoder architecture, demonstrating BEAST's compatibility and scalability with large pretrained models. We evaluate BEAST across three established benchmarks consisting of 166 simulated tasks and on three distinct robot settings with a total of 8 real-world tasks. Experimental results demonstrate that BEAST (i) significantly reduces both training and inference computational costs, and (ii) consistently generates smooth, high-frequency control signals suitable for continuous control tasks while (iii) reliably achieves competitive task success rates compared to state-of-the-art methods.

BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning

TL;DR

BEAST tackles the inefficiency of generating high-frequency continuous robot actions by introducing a B-spline encoded action sequence tokenizer that yields fixed-length tokens without requiring tokenizer training. By representing trajectories with B-spline control points and enabling parallel decoding, BEAST achieves fast inference while ensuring smooth transitions between action chunks. The approach is demonstrated across discrete and continuous token variants (BEAST-F, BEAST-D, BEAST-ACT) and multiple architectures, with strong performance on simulation benchmarks and real-world robots, and notable gains in training efficiency. This tokenizer offers a scalable, plug-and-play primitive for imitation learning in robotics, potentially enabling more responsive and robust autonomous manipulation without extensive tokenizer-induced training complexity.

Abstract

We present the B-spline Encoded Action Sequence Tokenizer (BEAST), a novel action tokenizer that encodes action sequences into compact discrete or continuous tokens using B-splines. In contrast to existing action tokenizers based on vector quantization or byte pair encoding, BEAST requires no separate tokenizer training and consistently produces tokens of uniform length, enabling fast action sequence generation via parallel decoding. Leveraging our B-spline formulation, BEAST inherently ensures generating smooth trajectories without discontinuities between adjacent segments. We extensively evaluate BEAST by integrating it with three distinct model architectures: a Variational Autoencoder (VAE) with continuous tokens, a decoder-only Transformer with discrete tokens, and Florence-2, a pretrained Vision-Language Model with an encoder-decoder architecture, demonstrating BEAST's compatibility and scalability with large pretrained models. We evaluate BEAST across three established benchmarks consisting of 166 simulated tasks and on three distinct robot settings with a total of 8 real-world tasks. Experimental results demonstrate that BEAST (i) significantly reduces both training and inference computational costs, and (ii) consistently generates smooth, high-frequency control signals suitable for continuous control tasks while (iii) reliably achieves competitive task success rates compared to state-of-the-art methods.

Paper Structure

This paper contains 25 sections, 3 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: From left to right: Clamped B-Spline Basis $P=0, 1, 2, 3, 4$ (top) and their generated trajectories (Bottom). Given the same control points, a higher degree will lead to smoother trajectories. All generated trajectories start exactly from the first control point and end at the last control point. Notably, action chunk is conceptually equivalent to B-Splines of 0-th degree, i.e., split-wise constants, as shown in the leftmost subplots. This relation is explained in details later in Section \ref{['subsec:astb']}.
  • Figure 2: Overview of the BEAST Encoding Pipeline: Given a normalized action sequence, the BEAST pipeline first uses linear regression to extract continuous-valued control points, forming control point matrices that serve as intermediate continuous representations. These matrices are then quantized uniformly into discrete values within the range $[0, 255]$ and subsequently flattened to produce discrete action tokens for auto-regressive next-token prediction or parallel prediction.
  • Figure 3: BEAST-F is a new VLA model that combines BEAST encoding with Florence-2 xiao2024florence, a lightweight VLM with 0.77B parameters. BEAST produces uniform-length tokens, which allows BEAST-F to perform parallel decoding via learnable action embeddings (AE), instead of autoregressive next-token prediction. These discrete tokens are fed into the B-Spline Decoder, which first maps them to real-valued control points and then transforms those control points into continuous action sequences. The Pr token denotes an optional proprioceptive state.
  • Figure 4: Simulation zhao2023learningmees2022calvinliu2024libero and real world (Franka Challenge, Aloha, Franka Kitchen) tasks.
  • Figure 5: Comparison among BEAST, single-step binning tokenization and binning tokenization with action chunking (AC). The comparison is conducted through the same auto-regressive model with different tokenizers to fit the same ground truth cube splines given the same context points. BEAST is smooth within each sequence and continuous at the transitions between sequences.
  • ...and 7 more figures