Quantization-Free Autoregressive Action Transformer
Ziyad Sheebaelhamd, Michael Tschannen, Michael Muehlebach, Claire Vernade
TL;DR
Quantization-Free Autoregressive Action Transformer (Q-FAT) replaces discrete action discretization with a continuous Generative Infinite-Vocabulary Transformer (GIVT) that parameterizes actions as a Gaussian Mixture Model, enabling end-to-end autoregressive policy learning with explicit likelihoods. The method incorporates goal conditioning and two sampling strategies—variance down-scaling and mean-shift mode sampling—to stabilize multimodal trajectories, achieving state-of-the-art results across multiple simulated robotics tasks while maintaining competitive inference speed. Extensive experiments compare Q-FAT against discretization-based baselines and diffusion policies, showing superior balance between action quality and diversity, and revealing practical insights into mixture counts, initialization, and sampling effects. The work discusses limitations (Euclidean action assumption) and outlines future directions for extending to non-Euclidean spaces, incorporating priors, and exploring coarse-to-fine strategies, with an emphasis on safety and reproducibility through open-source code.
Abstract
Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.
