D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens
Panpan Wang, Liqiang Niu, Fandong Meng, Jinan Xu, Yufeng Chen, Jie Zhou
TL;DR
The paper addresses the gap between discrete-token autoregressive models and diffusion-based continuous-token models by proposing D2C, a two-stage hybrid autoregressive framework. In stage one, a small discrete-valued autoregressive model generates coarse tokens from a class-conditioned prefix; in stage two, a MAE-inspired hybrid autoregressive model produces continuous-valued tokens conditioned on the discrete sequence, with two fusion modules enabling interaction. Empirical results on ImageNet-256 show that D2C, especially with a Q-Former fusion module, achieves state-of-the-art or competitive FID scores at reduced inference steps and with faster speed than relevant baselines, including MAR, while larger variants improve fidelity further. The work demonstrates that coupling discrete and continuous tokens via a structured two-stage process yields higher-quality and more efficient image generation, offering a promising direction for hybrid-token synthesis in visual generative modeling.
Abstract
In the domain of image generation, latent-based generative models occupy a dominant status; however, these models rely heavily on image tokenizer. To meet modeling requirements, autoregressive models possessing the characteristics of scalability and flexibility embrace a discrete-valued tokenizer, but face the challenge of poor image generation quality. In contrast, diffusion models take advantage of the continuous-valued tokenizer to achieve better generation quality but are subject to low efficiency and complexity. The existing hybrid models are mainly to compensate for information loss and simplify the diffusion learning process. The potential of merging discrete-valued and continuous-valued tokens in the field of image generation has not yet been explored. In this paper, we propose D2C, a novel two-stage method to enhance model generation capacity. In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator. Then in the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence. In addition, we design two kinds of fusion modules for seamless interaction. On the ImageNet-256 benchmark, extensive experiment results validate that our model achieves superior performance compared with several continuous-valued and discrete-valued generative models on the class-conditional image generation tasks.
