Table of Contents
Fetching ...

Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

Yao Teng, Fuyun Wang, Xian Liu, Zhekai Chen, Han Shi, Yu Wang, Zhenguo Li, Weiyang Liu, Difan Zou, Xihui Liu

TL;DR

This work tackles the latency of autoregressive text-to-image generation caused by token-by-token decoding. It introduces Speculative Jacobi-Denoising Decoding (SJD2), which embeds a diffusion-inspired denoising trajectory into Jacobi iterations and uses a next-clean-token prediction with noise-perturbed fine-tuning to enable parallel token refinement. The method uses a fixed-length Jacobi window, a parallel forward pass with a probabilistic verification to accept tokens, and a denoising-based refinement for unaccepted tokens, significantly reducing forward passes while preserving image quality. Empirical results on Lumina-mGPT and Emu-3 show reductions of about 4x–5x in decoding steps and >2x latency speedups, with competitive FID/CLIP metrics and manageable memory overhead, illustrating practical gains for autoregressive text-to-image generation.

Abstract

As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.

Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

TL;DR

This work tackles the latency of autoregressive text-to-image generation caused by token-by-token decoding. It introduces Speculative Jacobi-Denoising Decoding (SJD2), which embeds a diffusion-inspired denoising trajectory into Jacobi iterations and uses a next-clean-token prediction with noise-perturbed fine-tuning to enable parallel token refinement. The method uses a fixed-length Jacobi window, a parallel forward pass with a probabilistic verification to accept tokens, and a denoising-based refinement for unaccepted tokens, significantly reducing forward passes while preserving image quality. Empirical results on Lumina-mGPT and Emu-3 show reductions of about 4x–5x in decoding steps and >2x latency speedups, with competitive FID/CLIP metrics and manageable memory overhead, illustrating practical gains for autoregressive text-to-image generation.

Abstract

As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.

Paper Structure

This paper contains 36 sections, 4 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: We propose Speculative Jacobi-Denoising Decoding to accelerate autoregressive text-to-image generation via multi-token prediction. On Lumina-mGPT, the number of model forward passes for inference (denoted as steps) is reduced. The inference step for our decoding is marked in green.
  • Figure 2: Overview of our decoding process. The noisy token embeddings with increasing noise levels undergo a parallel forward pass with a causal attention mask, predicting conditional probabilities and then sampling clean tokens. A probabilistic criterion selects a prefix of tokens for acceptance (green area). For unaccepted tokens, if clean, the next-token prediction performs on them (green solid arrows marked with AR). If noisy, they are denoised with one-position offset (blue solid $\hat{x}^{0}$ arrows and dash arrows).
  • Figure 3: Overview of our training strategy and model process. The input token embedding sequence is randomly divided into segments and the adjacent segments are perturbed with the noise of consecutive levels. The noisy embeddings with timestep tokens are fed into transformer blocks and a prediction head. During training, the predicted probability is used to compute the cross-entropy loss for next-token prediction. During inference, the probability is for token sampling and then generating embeddings.
  • Figure 4: Correlation between denoising iterations and Jacobi window length on the latency. Circle areas represent absolute latency values, and the lowest latency (a difference within 3 seconds is allowed) is in orange.
  • Figure 5: Study on embedding normalization for denoising process. Left: Denoising output without embedding normalization, failing to generate a coherent image. Right: Denoising output with embedding normalization, generating a semantically meaningful image.
  • ...and 4 more figures