Table of Contents
Fetching ...

RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction

Xiaoping Wu, Jie Hu, Xiaoming Wei

TL;DR

RDPM presents a novel discrete-diffusion framework that denoises in a discrete token space by repeatedly predicting discrete codes across timesteps. It couples diffusion-based image tokenization in the VQ-VAE latent space with a recurrent transformer that performs next-token prediction under a cross-entropy objective, aligning diffusion with GPT-style training. The approach delivers high-quality image generation with only 10 denoising steps and a parameter count substantially smaller than many autoregressive models, while enabling diversity through stochastic guidance and Gumbel noise. By formulating diffusion as multi-step discrete token prediction, RDPM advances a unified paradigm for multimodal generation and lays groundwork for integrating continuous signals with text in future open-source releases.

Abstract

Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach for high-fidelity image synthesis, operating diffusion processes on continuous VAE latent, which significantly differ from the text generation methods employed by Large Language Models (LLMs). In this paper, we introduce a novel generative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which enhances the diffusion process through a recurrent token prediction mechanism, thereby pioneering the field of Discrete Diffusion. By progressively introducing Gaussian noise into the latent representations of images and encoding them into vector-quantized tokens in a recurrent manner, RDPM facilitates a unique diffusion process on discrete-value domains. This process iteratively predicts the token codes for subsequent timesteps, transforming the initial standard Gaussian noise into the source data distribution, aligning with GPT-style models in terms of the loss function. RDPM demonstrates superior performance while benefiting from the speed advantage of requiring only a few inference steps. This model not only leverages the diffusion process to ensure high-quality generation but also converts continuous signals into a series of high-fidelity discrete tokens, thereby maintaining a unified optimization strategy with other discrete tokens, such as text. We anticipate that this work will contribute to the development of a unified model for multimodal generation, specifically by integrating continuous signal domains such as images, videos, and audio with text. We will release the code and model weights to the open-source community.

RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction

TL;DR

RDPM presents a novel discrete-diffusion framework that denoises in a discrete token space by repeatedly predicting discrete codes across timesteps. It couples diffusion-based image tokenization in the VQ-VAE latent space with a recurrent transformer that performs next-token prediction under a cross-entropy objective, aligning diffusion with GPT-style training. The approach delivers high-quality image generation with only 10 denoising steps and a parameter count substantially smaller than many autoregressive models, while enabling diversity through stochastic guidance and Gumbel noise. By formulating diffusion as multi-step discrete token prediction, RDPM advances a unified paradigm for multimodal generation and lays groundwork for integrating continuous signals with text in future open-source releases.

Abstract

Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach for high-fidelity image synthesis, operating diffusion processes on continuous VAE latent, which significantly differ from the text generation methods employed by Large Language Models (LLMs). In this paper, we introduce a novel generative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which enhances the diffusion process through a recurrent token prediction mechanism, thereby pioneering the field of Discrete Diffusion. By progressively introducing Gaussian noise into the latent representations of images and encoding them into vector-quantized tokens in a recurrent manner, RDPM facilitates a unique diffusion process on discrete-value domains. This process iteratively predicts the token codes for subsequent timesteps, transforming the initial standard Gaussian noise into the source data distribution, aligning with GPT-style models in terms of the loss function. RDPM demonstrates superior performance while benefiting from the speed advantage of requiring only a few inference steps. This model not only leverages the diffusion process to ensure high-quality generation but also converts continuous signals into a series of high-fidelity discrete tokens, thereby maintaining a unified optimization strategy with other discrete tokens, such as text. We anticipate that this work will contribute to the development of a unified model for multimodal generation, specifically by integrating continuous signal domains such as images, videos, and audio with text. We will release the code and model weights to the open-source community.

Paper Structure

This paper contains 15 sections, 8 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: Taxonomy of modern image generation. From the continuous/discrete tokenizer and AR/Diffusion perspective, we categorize modern generation approaches into four types, including continuous diffusion, discrete diffusion, continuous AR, and discrete AR.
  • Figure 2: Visualizations of the ImageNet class-conditional $256 \times 256$ images generated by RDPM.
  • Figure 3: Comparison of various visual generation patterns. The images in rows 2 and 3, segmented into blocks, represent discrete quantized tokens, whereas the images in row 1 are depicted with raw pixels. The gray grids denote masked tokens.
  • Figure 4: Overview of the RDPM framework. The framework comprises two main stages: 1) Diffusion-based image tokenization, where Gaussian noise is incrementally mixed into the image and quantized to obtain discrete codes, and 2) Recurrent token prediction using a transformer model, which efficiently synthesizes images by recurrently predicting discrete visual codes at each timestep.
  • Figure 5: (Left) Variation of $\alpha_t$ across different scheduling strategies. (Right) Changes in FID as the hyperparameter $\varphi$ varies under the $pow$ schedule. The experiment is conducted on the model with depth of $12$ and trained for $100$ epochs.