Table of Contents
Fetching ...

DiSA: Diffusion Step Annealing in Autoregressive Image Generation

Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng

TL;DR

This work addresses the efficiency bottleneck of diffusion-augmented autoregressive image generation by proposing Diffusion Step Annealing (DiSA), a training-free strategy that progressively reduces diffusion steps as more tokens are generated. Grounded in observations that later-generation tokens are more constrained and easier to sample, DiSA schedules steps via two-stage, linear, or cosine variants, e.g., from $T_{early}=50$ to $T_{late}=5$ steps, achieving substantial speedups (up to 10×) across MAR, Harmon, FlowAR, and xAR with minimal impact on image quality. The method is complementary to existing diffusion accelerations and shows robust gains on ImageNet 256×256 and GenEval benchmarks. Overall, DiSA provides a simple, practical approach to accelerate diffusion in autoregressive generators, with broad applicability and minimal changes to training.

Abstract

An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves $5-10\times$ faster inference for MAR and Harmon and $1.4-2.5\times$ for FlowAR and xAR, while maintaining the generation quality.

DiSA: Diffusion Step Annealing in Autoregressive Image Generation

TL;DR

This work addresses the efficiency bottleneck of diffusion-augmented autoregressive image generation by proposing Diffusion Step Annealing (DiSA), a training-free strategy that progressively reduces diffusion steps as more tokens are generated. Grounded in observations that later-generation tokens are more constrained and easier to sample, DiSA schedules steps via two-stage, linear, or cosine variants, e.g., from to steps, achieving substantial speedups (up to 10×) across MAR, Harmon, FlowAR, and xAR with minimal impact on image quality. The method is complementary to existing diffusion accelerations and shows robust gains on ImageNet 256×256 and GenEval benchmarks. Overall, DiSA provides a simple, practical approach to accelerate diffusion in autoregressive generators, with broad applicability and minimal changes to training.

Abstract

An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves faster inference for MAR and Harmon and for FlowAR and xAR, while maintaining the generation quality.

Paper Structure

This paper contains 11 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview. Architecture of four "autoregressive + diffusion" models included in this study: (a) MAR mar; (b) FlowAR ren2024flowar; (c) xAR xar; (d) Harmon harmon. (e) This paper improves the efficiency of these models by reducing diffusion steps without compromising generation quality.
  • Figure 2: Image prediction results at different stages of generation. In each image pair, the left image shows the currently generated tokens, while the right shows the final image we predict based on the generated tokens. The prediction results are inaccurate and lack details in early stages but become increasingly accurate as more tokens are generated. This is consistent across the four models.
  • Figure 3: Diffusion processes in later generation stages show (a-b) lower variance and (c) closer-to-straight-line denoising paths. (a) Two examples. In each example, the autoregressive step increases from top to bottom rows. 0%, 10%, 20% of tokens have been generated, respectively, as shown in the first column. We observe that the variance of sampled images drops from top to bottom rows. (b) Variance of diffusion-sampled tokens decreases along the autoregressive steps. The y-axis uses a logarithmic scale and each line represents a different token dimension. (c) Straightness of denoising paths increases from early to late stages. All results are obtained from the MAR-B model.
  • Figure 4: Impact of different numbers of diffusion steps in early generation stages $T_{early}$ and in late stages $T_{late}$ on (a) MAR-B; (b) MAR-L. In the first and third columns, we fix $T_{late}=50$ and reduce $T_{early}$, which significantly degrades generation quality. But as shown in the second and fourth columns, if we fix $T_{early}=50$ and decrease $T_{late}$, the degradation in generation quality is marginal.
  • Figure 5: Speed-quality trade-off for (a) MAR-B with {16, 32, 64, 128} autoregressive steps; (b) MAR-B with {25, 50, 100} diffusion steps; (c) MAR-L with {16, 32, 64, 128} autoregressive steps; (d) MAR-L with {25, 50, 100} diffusion steps; (e) FlowAR-L with {8, 10, 15, 20, 25 } flow matching steps; (f) xAR-B and (g) xAR-L with {15, 20, 25, 40, 50} flow matching steps; and (h) Harmon-1.5B with different autoregressive and diffusion steps.
  • ...and 1 more figures