Table of Contents
Fetching ...

BitDance: Scaling Autoregressive Generative Models with Binary Tokens

Yuang Ai, Jiaming Han, Shaobin Zhuang, Weijia Mao, Xuefeng Hu, Ziyan Yang, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen

TL;DR

BitDance tackles the bottlenecks of autoregressive visual generation by adopting a highly expressive binary token space with up to $2^{256}$ states, paired with a diffusion-based sampling head to handle the vast discrete space. The method further accelerates decoding through next-patch diffusion, enabling parallel prediction of token groups while preserving coherence. Empirically, BitDance achieves state-of-the-art AR results on ImageNet-256 (FID $=1.24$) with a compact 260M parameter model and delivers substantial speedups (up to 8.7×) over larger parallel AR baselines; for text-to-image tasks, a 14B-parameter variant demonstrates competitive, multi-modal performance with efficient high-resolution synthesis and a notable speedup on 1024×1024 images. These results underscore the viability of scaling token entropy and using diffusion-based sampling to enable efficient, high-fidelity AR foundation models, with open-source code and models released to advance research.

Abstract

We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to $2^{256}$ states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: https://github.com/shallowdream204/BitDance.

BitDance: Scaling Autoregressive Generative Models with Binary Tokens

TL;DR

BitDance tackles the bottlenecks of autoregressive visual generation by adopting a highly expressive binary token space with up to states, paired with a diffusion-based sampling head to handle the vast discrete space. The method further accelerates decoding through next-patch diffusion, enabling parallel prediction of token groups while preserving coherence. Empirically, BitDance achieves state-of-the-art AR results on ImageNet-256 (FID ) with a compact 260M parameter model and delivers substantial speedups (up to 8.7×) over larger parallel AR baselines; for text-to-image tasks, a 14B-parameter variant demonstrates competitive, multi-modal performance with efficient high-resolution synthesis and a notable speedup on 1024×1024 images. These results underscore the viability of scaling token entropy and using diffusion-based sampling to enable efficient, high-fidelity AR foundation models, with open-source code and models released to advance research.

Abstract

We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: https://github.com/shallowdream204/BitDance.
Paper Structure (19 sections, 7 equations, 9 figures, 14 tables)

This paper contains 19 sections, 7 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Performance vs. efficiency compared with SOTA diffusion models and autoregressive models.
  • Figure 2: High-resolution samples generated by the 14B BitDance model, showcasing its capabilities in prompt adherence, spatial reasoning, and text rendering across various aspect ratios and artistic styles.
  • Figure 3: Comparison of binary token sampling paradigms. Scaling up binary token entropy yields reconstruction performance on par with continuous VAEs, but it simultaneously creates a bottleneck during sampling. For a $d$-channel binary token: (a) Directly modeling $p(b_1,b_2,\dots,b_d)$ requires $h\times2^d$ parameters, which suffers from an exponential explosion as $d$ scales. (b) Bit-wise classification han2025infinity reduces the parameter count to $h\times2d$ by assuming bit independence, i.e., ${ \prod_{i=1}^{d}} p(b_i)$, but this restrictive assumption compromises sampling fidelity. (c) We embed binary tokens as vertices of a $d$-dimensional hypercube in continuous space. By modeling the joint distribution of all bits via a diffusion objective, we achieve controllable parameters as $d$ scales up and high-fidelity sampling.
  • Figure 4: Architecture of BitDance, an autoregressive model trained on multi-modal tokens. An input image is first encoded into binary latents and then flattened into a 1D sequence following a patch-wise raster scan order with patch size $p\times p$. Vision tokens are modeled using our proposed next-patch diffusion, utilizing a binary diffusion head to achieve efficient and precise parallel prediction.
  • Figure 5: Comparison of different sampling heads for parallel prediction in autoregressive models. (a) The standard classification head is limited to independent token sampling, which violates the inherent dependencies required for parallel prediction. (b) Our proposed binary diffusion head models the joint distribution of tokens generated simultaneously, enabling coherent sampling.
  • ...and 4 more figures