Table of Contents
Fetching ...

Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha

TL;DR

This work tackles the misalignment between autoregressive image generation and conventional tokenizers by introducing AliTok, a tokenizer that enforces forward-dependency through a causal decoder while preserving the encoder’s bidirectional semantic richness. A two-stage training regime—generation-friendly encoder in Stage 1 and a refined bidirectional decoder in Stage 2—along with prefix tokens and an auxiliary loss, yields tokens that are both highly reconstructible and easily modeled by decoder-only AR models. Empirically, decoder-only AR models with AliTok achieve state-of-the-art or competitive gFID scores on ImageNet-256, with a 662M model reaching gFID $1.28$ and substantially faster sampling than diffusion baselines, highlighting the practical impact of data-centric alignment. The results suggest that carefully designed tokenizers can unlock the full potential of simple autoregressive generation for high-fidelity, efficient multimodal synthesis, with broad implications for future multimodal unification.

Abstract

Autoregressive image generation aims to predict the next token based on previous ones. However, this process is challenged by the bidirectional dependencies inherent in conventional image tokenizations, which creates a fundamental misalignment with the unidirectional nature of autoregressive models. To resolve this, we introduce AliTok, a novel Aligned Tokenizer that alters the dependency structure of the token sequence. AliTok employs a bidirectional encoder constrained by a causal decoder, a design that compels the encoder to produce a token sequence with both semantic richness and forward-dependency. Furthermore, by incorporating prefix tokens and employing a two-stage tokenizer training process to enhance reconstruction performance, AliTok achieves high fidelity and predictability simultaneously. Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on the ImageNet-256 benchmark. Scaling up to 662M parameters, our model reaches a gFID of 1.28, surpassing the state-of-the-art diffusion method while achieving a 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.

Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

TL;DR

This work tackles the misalignment between autoregressive image generation and conventional tokenizers by introducing AliTok, a tokenizer that enforces forward-dependency through a causal decoder while preserving the encoder’s bidirectional semantic richness. A two-stage training regime—generation-friendly encoder in Stage 1 and a refined bidirectional decoder in Stage 2—along with prefix tokens and an auxiliary loss, yields tokens that are both highly reconstructible and easily modeled by decoder-only AR models. Empirically, decoder-only AR models with AliTok achieve state-of-the-art or competitive gFID scores on ImageNet-256, with a 662M model reaching gFID and substantially faster sampling than diffusion baselines, highlighting the practical impact of data-centric alignment. The results suggest that carefully designed tokenizers can unlock the full potential of simple autoregressive generation for high-fidelity, efficient multimodal synthesis, with broad implications for future multimodal unification.

Abstract

Autoregressive image generation aims to predict the next token based on previous ones. However, this process is challenged by the bidirectional dependencies inherent in conventional image tokenizations, which creates a fundamental misalignment with the unidirectional nature of autoregressive models. To resolve this, we introduce AliTok, a novel Aligned Tokenizer that alters the dependency structure of the token sequence. AliTok employs a bidirectional encoder constrained by a causal decoder, a design that compels the encoder to produce a token sequence with both semantic richness and forward-dependency. Furthermore, by incorporating prefix tokens and employing a two-stage tokenizer training process to enhance reconstruction performance, AliTok achieves high fidelity and predictability simultaneously. Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on the ImageNet-256 benchmark. Scaling up to 662M parameters, our model reaches a gFID of 1.28, surpassing the state-of-the-art diffusion method while achieving a 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.

Paper Structure

This paper contains 17 sections, 5 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: 256$\times$256 samples of class-conditional generation on ImageNet using our AliTok-XL model (662M).
  • Figure 2: Reconstruction $vs$ generation with different transformer-based tokenizers. Images are compressed into raster-scan order 1D sequences by the tokenizers. AR are standard decoder-only autoregressive models. green for poor results and red for good results. The best results are bolded. Tok. : Tokenizer. Acc: Training accuracy. Fair setup with matched parameter counts and computational loads. See Appendix \ref{['sec:details_supp']} for details.
  • Figure 3: Sampling time and gFID (w/o cfg and w/ cfg). Sampling time is evaluated on an A800.
  • Figure 4: Two-stage training process of the proposed AliTok. Stage 1: Training an image tokenizer with a causal decoder. Stage 2: Freezing the encoder and codebook of the tokenizer, training the autoregressive model and retraining a bidirectional tokenizer decoder.
  • Figure 5: 256$\times$256 samples generated by our models of different sizes.
  • ...and 10 more figures