Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha
TL;DR
This work tackles the misalignment between autoregressive image generation and conventional tokenizers by introducing AliTok, a tokenizer that enforces forward-dependency through a causal decoder while preserving the encoder’s bidirectional semantic richness. A two-stage training regime—generation-friendly encoder in Stage 1 and a refined bidirectional decoder in Stage 2—along with prefix tokens and an auxiliary loss, yields tokens that are both highly reconstructible and easily modeled by decoder-only AR models. Empirically, decoder-only AR models with AliTok achieve state-of-the-art or competitive gFID scores on ImageNet-256, with a 662M model reaching gFID $1.28$ and substantially faster sampling than diffusion baselines, highlighting the practical impact of data-centric alignment. The results suggest that carefully designed tokenizers can unlock the full potential of simple autoregressive generation for high-fidelity, efficient multimodal synthesis, with broad implications for future multimodal unification.
Abstract
Autoregressive image generation aims to predict the next token based on previous ones. However, this process is challenged by the bidirectional dependencies inherent in conventional image tokenizations, which creates a fundamental misalignment with the unidirectional nature of autoregressive models. To resolve this, we introduce AliTok, a novel Aligned Tokenizer that alters the dependency structure of the token sequence. AliTok employs a bidirectional encoder constrained by a causal decoder, a design that compels the encoder to produce a token sequence with both semantic richness and forward-dependency. Furthermore, by incorporating prefix tokens and employing a two-stage tokenizer training process to enhance reconstruction performance, AliTok achieves high fidelity and predictability simultaneously. Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on the ImageNet-256 benchmark. Scaling up to 662M parameters, our model reaches a gFID of 1.28, surpassing the state-of-the-art diffusion method while achieving a 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.
