Autoregressive Image Generation with Masked Bit Modeling
Qihang Yu, Qihao Liu, Ju He, Xinyang Zhang, Yang Liu, Liang-Chieh Chen, Xi Chen
TL;DR
The paper reframes visual generation by showing that the performance gap between discrete and continuous tokenizers is driven by the latent bit budget $B$, not an inherent limitation of discrete representations. It introduces BAR, a scalable autoregressive framework with a Masked Bit Modeling head that generates token bits progressively, enabling arbitrarily large codebooks with $O( obreak ext{log}_2 C)$ memory per step. Empirically, BAR closes or even reverses the discrete–continuous gap, achieving a gFID of $0.99$ on ImageNet-256 and surpassing diffusion-based methods while offering faster sampling and training efficiency. This approach demonstrates that high-quality image generation can be attained with discrete tokenization and provides a practical path toward more accessible, efficient generative models, while highlighting the need for responsible deployment.
Abstract
This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at https://bar-gen.github.io/
