Action Tokenizer Matters in In-Context Imitation Learning
An Dinh Vuong, Minh Nhat Vu, Dong An, Ian Reid
TL;DR
The paper addresses generalization in in-context imitation learning (ICIL) by focusing on action representation and temporal smoothness. It proposes LipVQ-VAE, a Lipschitz-regularized vector-quantized action tokenizer integrated with the ICRT framework to produce smooth, discrete latent action codes that improve robustness. LipVQ-VAE yields notable gains, including a $>5.3\%$ improvement in high-fidelity simulators and smoother real-world trajectories, with sim-to-real transfer demonstrated on a Kinova arm. This work underscores the importance of a smooth latent action space for reliable ICIL and offers a concrete tokenizer design to advance practical robotic manipulation.
Abstract
In-context imitation learning (ICIL) is a new paradigm that enables robots to generalize from demonstrations to unseen tasks without retraining. A well-structured action representation is the key to capturing demonstration information effectively, yet action tokenizer (the process of discretizing and encoding actions) remains largely unexplored in ICIL. In this work, we first systematically evaluate existing action tokenizer methods in ICIL and reveal a critical limitation: while they effectively encode action trajectories, they fail to preserve temporal smoothness, which is crucial for stable robotic execution. To address this, we propose LipVQ-VAE, a variational autoencoder that enforces the Lipschitz condition in the latent action space via weight normalization. By propagating smoothness constraints from raw action inputs to a quantized latent codebook, LipVQ-VAE generates more stable and smoother actions. When integrating into ICIL, LipVQ-VAE improves performance by more than 5.3% in high-fidelity simulators, with real-world experiments confirming its ability to produce smoother, more reliable trajectories. Code and checkpoints are available at https://action-tokenizer-matters.github.io/
