Table of Contents
Fetching ...

Autoregressive Image Generation with Masked Bit Modeling

Qihang Yu, Qihao Liu, Ju He, Xinyang Zhang, Yang Liu, Liang-Chieh Chen, Xi Chen

TL;DR

The paper reframes visual generation by showing that the performance gap between discrete and continuous tokenizers is driven by the latent bit budget $B$, not an inherent limitation of discrete representations. It introduces BAR, a scalable autoregressive framework with a Masked Bit Modeling head that generates token bits progressively, enabling arbitrarily large codebooks with $O( obreak ext{log}_2 C)$ memory per step. Empirically, BAR closes or even reverses the discrete–continuous gap, achieving a gFID of $0.99$ on ImageNet-256 and surpassing diffusion-based methods while offering faster sampling and training efficiency. This approach demonstrates that high-quality image generation can be attained with discrete tokenization and provides a practical path toward more accessible, efficient generative models, while highlighting the need for responsible deployment.

Abstract

This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at https://bar-gen.github.io/

Autoregressive Image Generation with Masked Bit Modeling

TL;DR

The paper reframes visual generation by showing that the performance gap between discrete and continuous tokenizers is driven by the latent bit budget , not an inherent limitation of discrete representations. It introduces BAR, a scalable autoregressive framework with a Masked Bit Modeling head that generates token bits progressively, enabling arbitrarily large codebooks with memory per step. Empirically, BAR closes or even reverses the discrete–continuous gap, achieving a gFID of on ImageNet-256 and surpassing diffusion-based methods while offering faster sampling and training efficiency. This approach demonstrates that high-quality image generation can be attained with discrete tokenization and provides a practical path toward more accessible, efficient generative models, while highlighting the need for responsible deployment.

Abstract

This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at https://bar-gen.github.io/
Paper Structure (16 sections, 7 equations, 18 figures, 8 tables)

This paper contains 16 sections, 7 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: The proposed BAR achieves a superior quality-cost trade-off (generation FID vs. throughput) on ImageNet-256.
  • Figure 2: Best discrete and continuous generator comparison.
  • Figure 3: A unified view for comparing discrete and continuous tokenizers. By measuring information capacity in bits, we enable a direct comparison. The continuous tokenizer MAR-VAE li2024autoregressive outperforms the discrete tokenizer LlamaGen-VQ sun2024autoregressive in reconstruction quality, a result directly attributable to its substantially higher bit allocation.
  • Figure 4: Scaling BAR's discrete tokenizer (BAR-FSQ) with Bit Budget. Standard discrete methods (green circles) historically lag behind continuous baselines (blue circles) primarily due to restricted bit allocation. By systematically scaling the codebook size, BAR-FSQ (red curve) demonstrates that discrete tokenizer's reconstruction performance is not inherently bounded; it matches and further surpasses continuous reconstruction fidelity with increased bit budget, challenging the assumption that continuous latent spaces are required for high-fidelity reconstruction.
  • Figure 5: Overview of the proposed BAR framework. We decompose autoregressive visual generation into two stages: context modeling and token prediction. (a) For context modeling, we employ an autoregressive transformer to generate latent conditions via causal attention. For the subsequent token prediction stage, we contrast our method with two baselines: (b) A standard linear head predicts logits over the full codebook. While effective for small vocabularies ($<2^{18}$), it fails to scale to larger sizes due to computational bottlenecks. (c) A bit-based head predicts bits directly; while scalable, it results in inferior generation quality. (d) The proposed Masked Bit Modeling (MBM) head generates bits via a progressive unmasking mechanism conditioned on the autoregressive transformer's output. Unlike the baselines, MBM achieves both exceptional scalability and superior generation quality.
  • ...and 13 more figures