RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction
Xiaoping Wu, Jie Hu, Xiaoming Wei
TL;DR
RDPM presents a novel discrete-diffusion framework that denoises in a discrete token space by repeatedly predicting discrete codes across timesteps. It couples diffusion-based image tokenization in the VQ-VAE latent space with a recurrent transformer that performs next-token prediction under a cross-entropy objective, aligning diffusion with GPT-style training. The approach delivers high-quality image generation with only 10 denoising steps and a parameter count substantially smaller than many autoregressive models, while enabling diversity through stochastic guidance and Gumbel noise. By formulating diffusion as multi-step discrete token prediction, RDPM advances a unified paradigm for multimodal generation and lays groundwork for integrating continuous signals with text in future open-source releases.
Abstract
Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach for high-fidelity image synthesis, operating diffusion processes on continuous VAE latent, which significantly differ from the text generation methods employed by Large Language Models (LLMs). In this paper, we introduce a novel generative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which enhances the diffusion process through a recurrent token prediction mechanism, thereby pioneering the field of Discrete Diffusion. By progressively introducing Gaussian noise into the latent representations of images and encoding them into vector-quantized tokens in a recurrent manner, RDPM facilitates a unique diffusion process on discrete-value domains. This process iteratively predicts the token codes for subsequent timesteps, transforming the initial standard Gaussian noise into the source data distribution, aligning with GPT-style models in terms of the loss function. RDPM demonstrates superior performance while benefiting from the speed advantage of requiring only a few inference steps. This model not only leverages the diffusion process to ensure high-quality generation but also converts continuous signals into a series of high-fidelity discrete tokens, thereby maintaining a unified optimization strategy with other discrete tokens, such as text. We anticipate that this work will contribute to the development of a unified model for multimodal generation, specifically by integrating continuous signal domains such as images, videos, and audio with text. We will release the code and model weights to the open-source community.
