Table of Contents
Fetching ...

Quantization-Free Autoregressive Action Transformer

Ziyad Sheebaelhamd, Michael Tschannen, Michael Muehlebach, Claire Vernade

TL;DR

Quantization-Free Autoregressive Action Transformer (Q-FAT) replaces discrete action discretization with a continuous Generative Infinite-Vocabulary Transformer (GIVT) that parameterizes actions as a Gaussian Mixture Model, enabling end-to-end autoregressive policy learning with explicit likelihoods. The method incorporates goal conditioning and two sampling strategies—variance down-scaling and mean-shift mode sampling—to stabilize multimodal trajectories, achieving state-of-the-art results across multiple simulated robotics tasks while maintaining competitive inference speed. Extensive experiments compare Q-FAT against discretization-based baselines and diffusion policies, showing superior balance between action quality and diversity, and revealing practical insights into mixture counts, initialization, and sampling effects. The work discusses limitations (Euclidean action assumption) and outlines future directions for extending to non-Euclidean spaces, incorporating priors, and exploring coarse-to-fine strategies, with an emphasis on safety and reproducibility through open-source code.

Abstract

Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.

Quantization-Free Autoregressive Action Transformer

TL;DR

Quantization-Free Autoregressive Action Transformer (Q-FAT) replaces discrete action discretization with a continuous Generative Infinite-Vocabulary Transformer (GIVT) that parameterizes actions as a Gaussian Mixture Model, enabling end-to-end autoregressive policy learning with explicit likelihoods. The method incorporates goal conditioning and two sampling strategies—variance down-scaling and mean-shift mode sampling—to stabilize multimodal trajectories, achieving state-of-the-art results across multiple simulated robotics tasks while maintaining competitive inference speed. Extensive experiments compare Q-FAT against discretization-based baselines and diffusion policies, showing superior balance between action quality and diversity, and revealing practical insights into mixture counts, initialization, and sampling effects. The work discusses limitations (Euclidean action assumption) and outlines future directions for extending to non-Euclidean spaces, incorporating priors, and exploring coarse-to-fine strategies, with an emphasis on safety and reproducibility through open-source code.

Abstract

Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.

Paper Structure

This paper contains 35 sections, 15 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Figure (a) shows the overview of QFAT. A sequence of $h^{s}$ previous states and $h^{g}$ goal states are projected into the Transformer's embedding dimension using a linear layer. The Transformer then predicts the action distribution by predicting the GMM means, variances and mixture probabilities. Figure (b) shows a sample of generated trajectories from 3 representative environments, demonstrating the captured multi-modality in the sequence of solving tasks (Kitchen and UR3 Block Push) or the direction from which an object is approached (PushT).
  • Figure 2: The figures demonstrate the potential detrimental effects of down-scaling the component variances in a learned Gaussian Mixture Model (GMM). The left figure illustrates how down-scaling variance introduces irreducible variance proportional to the distance between component means, when the number of components ($k$) is misspecified (e.g., approximating a unimodal distribution with two components). The two figures on the right show how reducing variance can lead to a loss of multimodality, specifically causing the central mode to disappear in a 2D two-mixture Gaussian that presents three modes.
  • Figure 3: Overview of Q-FAT's behavioural entropy on unconditional behavior generation compared to baselines.
  • Figure 4: Sampling techniques from an 8-mixture Q-FAT policy on a multiroute environment bet. (a) Raw dataset with two pairs of equally likely paths (blue and red) from start to target (green). (b) Direct GMM sampling yields noisy samples due to the captured dataset noise. (c) Variance scaling ($10^{-8}$)reduces per-component variance but not the variance from inter-component distances.(d) Mode sampling largely suppresses noise while preserving dataset multimodality.
  • Figure 5: We visualize 400 trajectories generated from Q-FAT with 16 mixtures on the PushT environment with different sampling methods. One can see that mode sampling reduces the variance and does not produce trajectory artifacts.
  • ...and 1 more figures