Table of Contents
Fetching ...

Action Tokenizer Matters in In-Context Imitation Learning

An Dinh Vuong, Minh Nhat Vu, Dong An, Ian Reid

TL;DR

The paper addresses generalization in in-context imitation learning (ICIL) by focusing on action representation and temporal smoothness. It proposes LipVQ-VAE, a Lipschitz-regularized vector-quantized action tokenizer integrated with the ICRT framework to produce smooth, discrete latent action codes that improve robustness. LipVQ-VAE yields notable gains, including a $>5.3\%$ improvement in high-fidelity simulators and smoother real-world trajectories, with sim-to-real transfer demonstrated on a Kinova arm. This work underscores the importance of a smooth latent action space for reliable ICIL and offers a concrete tokenizer design to advance practical robotic manipulation.

Abstract

In-context imitation learning (ICIL) is a new paradigm that enables robots to generalize from demonstrations to unseen tasks without retraining. A well-structured action representation is the key to capturing demonstration information effectively, yet action tokenizer (the process of discretizing and encoding actions) remains largely unexplored in ICIL. In this work, we first systematically evaluate existing action tokenizer methods in ICIL and reveal a critical limitation: while they effectively encode action trajectories, they fail to preserve temporal smoothness, which is crucial for stable robotic execution. To address this, we propose LipVQ-VAE, a variational autoencoder that enforces the Lipschitz condition in the latent action space via weight normalization. By propagating smoothness constraints from raw action inputs to a quantized latent codebook, LipVQ-VAE generates more stable and smoother actions. When integrating into ICIL, LipVQ-VAE improves performance by more than 5.3% in high-fidelity simulators, with real-world experiments confirming its ability to produce smoother, more reliable trajectories. Code and checkpoints are available at https://action-tokenizer-matters.github.io/

Action Tokenizer Matters in In-Context Imitation Learning

TL;DR

The paper addresses generalization in in-context imitation learning (ICIL) by focusing on action representation and temporal smoothness. It proposes LipVQ-VAE, a Lipschitz-regularized vector-quantized action tokenizer integrated with the ICRT framework to produce smooth, discrete latent action codes that improve robustness. LipVQ-VAE yields notable gains, including a improvement in high-fidelity simulators and smoother real-world trajectories, with sim-to-real transfer demonstrated on a Kinova arm. This work underscores the importance of a smooth latent action space for reliable ICIL and offers a concrete tokenizer design to advance practical robotic manipulation.

Abstract

In-context imitation learning (ICIL) is a new paradigm that enables robots to generalize from demonstrations to unseen tasks without retraining. A well-structured action representation is the key to capturing demonstration information effectively, yet action tokenizer (the process of discretizing and encoding actions) remains largely unexplored in ICIL. In this work, we first systematically evaluate existing action tokenizer methods in ICIL and reveal a critical limitation: while they effectively encode action trajectories, they fail to preserve temporal smoothness, which is crucial for stable robotic execution. To address this, we propose LipVQ-VAE, a variational autoencoder that enforces the Lipschitz condition in the latent action space via weight normalization. By propagating smoothness constraints from raw action inputs to a quantized latent codebook, LipVQ-VAE generates more stable and smoother actions. When integrating into ICIL, LipVQ-VAE improves performance by more than 5.3% in high-fidelity simulators, with real-world experiments confirming its ability to produce smoother, more reliable trajectories. Code and checkpoints are available at https://action-tokenizer-matters.github.io/

Paper Structure

This paper contains 11 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: We examine the impact of action tokenizer on in-context imitation learning. Our findings indicate that a smoother representation of action correlates with higher robotic manipulation success. Note in the figure: a lower smoothness score reflects a smoother action representation.
  • Figure 2: ICRT architecture. ICRT models action prediction as the next-token generation, utilizing prompt demonstrations for in-context learning.
  • Figure 3: LipVQ-VAE action tokenizer overview. We adopt an autoencoder framework that maps actions to a latent space via codebook lookup. To ensure smooth latent representation, we apply Lipschitz regularization by row-wise normalizing the weight matrix after each encoder layer.
  • Figure 4: Latent trajectories visualization. Using t-SNE, we visualize latent representations of different action tokenizers given the same action trajectory.
  • Figure 5: Impact of LipVQ-VAE's codebook size in RoboCasa.
  • ...and 1 more figures