Table of Contents
Fetching ...

Fast-ARDiff: An Entropy-informed Acceleration Framework for Continuous Space Autoregressive Generation

Zhen Zou, Xiaoxiao Ma, Jie Huang, Zichao Yu, Feng Zhao

TL;DR

This work targets the latency bottleneck of continuous-space AR+diffusion hybrids by diagnosing entropy mismatch between draft autoregressive components and large verifiers. It introduces Fast-ARDiff, a unified framework featuring entropy-informed speculative decoding, two-stage diffusion distillation with initialization adaptation, and an end-to-end training/inference pipeline with dynamic loss weighting and entropy-based early stopping. Empirical results demonstrate state-of-the-art acceleration on ImageNet 256×256 (up to several-fold speedups) and faster text-conditioned generation, with ablations confirming the contribution of each component. The approach enables practical deployment of high-fidelity AR+diffusion systems by tightly coupling semantic guidance with efficient diffusion synthesis.

Abstract

Autoregressive(AR)-diffusion hybrid paradigms combine AR's structured modeling with diffusion's photorealistic synthesis, yet suffer from high latency due to sequential AR generation and iterative denoising. In this work, we tackle this bottleneck and propose a unified AR-diffusion framework Fast-ARDiff that jointly optimizes both components, accelerating AR speculative decoding while simultaneously facilitating faster diffusion decoding. Specifically: (1) The entropy-informed speculative strategy encourages draft model to produce higher-entropy representations aligned with target model's entropy characteristics, mitigating entropy mismatch and high rejection rates caused by draft overconfidence. (2) For diffusion decoding, rather than treating it as an independent module, we integrate it into the same end-to-end framework using a dynamic scheduler that prioritizes AR optimization to guide the diffusion part in further steps. The diffusion part is optimized through a joint distillation framework combining trajectory and distribution matching, ensuring stable training and high-quality synthesis with extremely few steps. During inference, shallow feature entropy from AR module is used to pre-filter low-entropy drafts, avoiding redundant computation and improving latency. Fast-ARDiff achieves state-of-the-art acceleration across diverse models: on ImageNet 256$\times$256, TransDiff attains 4.3$\times$ lossless speedup, and NextStep-1 achieves 3$\times$ acceleration on text-conditioned generation. Code will be available at https://github.com/aSleepyTree/Fast-ARDiff.

Fast-ARDiff: An Entropy-informed Acceleration Framework for Continuous Space Autoregressive Generation

TL;DR

This work targets the latency bottleneck of continuous-space AR+diffusion hybrids by diagnosing entropy mismatch between draft autoregressive components and large verifiers. It introduces Fast-ARDiff, a unified framework featuring entropy-informed speculative decoding, two-stage diffusion distillation with initialization adaptation, and an end-to-end training/inference pipeline with dynamic loss weighting and entropy-based early stopping. Empirical results demonstrate state-of-the-art acceleration on ImageNet 256×256 (up to several-fold speedups) and faster text-conditioned generation, with ablations confirming the contribution of each component. The approach enables practical deployment of high-fidelity AR+diffusion systems by tightly coupling semantic guidance with efficient diffusion synthesis.

Abstract

Autoregressive(AR)-diffusion hybrid paradigms combine AR's structured modeling with diffusion's photorealistic synthesis, yet suffer from high latency due to sequential AR generation and iterative denoising. In this work, we tackle this bottleneck and propose a unified AR-diffusion framework Fast-ARDiff that jointly optimizes both components, accelerating AR speculative decoding while simultaneously facilitating faster diffusion decoding. Specifically: (1) The entropy-informed speculative strategy encourages draft model to produce higher-entropy representations aligned with target model's entropy characteristics, mitigating entropy mismatch and high rejection rates caused by draft overconfidence. (2) For diffusion decoding, rather than treating it as an independent module, we integrate it into the same end-to-end framework using a dynamic scheduler that prioritizes AR optimization to guide the diffusion part in further steps. The diffusion part is optimized through a joint distillation framework combining trajectory and distribution matching, ensuring stable training and high-quality synthesis with extremely few steps. During inference, shallow feature entropy from AR module is used to pre-filter low-entropy drafts, avoiding redundant computation and improving latency. Fast-ARDiff achieves state-of-the-art acceleration across diverse models: on ImageNet 256256, TransDiff attains 4.3 lossless speedup, and NextStep-1 achieves 3 acceleration on text-conditioned generation. Code will be available at https://github.com/aSleepyTree/Fast-ARDiff.

Paper Structure

This paper contains 19 sections, 10 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Qualitative generation comparison between the vanilla NextStep-1 team2025nextstep and our method on GenEval ghosh2023geneval.
  • Figure 2: Entropy distribution comparison between vision and language models, and poor diversity of small draft models. (a) Entropy distribution of vision models (small vs large AR models); (b) Entropy distribution of language models; (c) Insufficient diversity of outputs from small draft models, which generate similar images .
  • Figure 3: Illustration of Fast-ARDiff: an end-to-end unified framework enabling mutual perception between speculative decoding (AR branch) and diffusion distillation (CD/DMD branches). (Left) The entropy-informed loss alleviates entropy mismatch in the AR draft model for better speculative generation, while providing entropy-aligned semantic guidance to stabilize subsequent diffusion distillation. (Right) Two-stage diffusion distillation feeds back into the end-to-end pipeline, dynamically regulating AR’s feature learning to align with diffusion’s high-fidelity demands. This bidirectional interaction(entropy guidance refining diffusion initialization and diffusion loss optimizing AR modeling) enables tight cross-module synergy via end-to-end optimization.
  • Figure 4: Comparison with NextStep-1 team2025nextstep on MJHQ-30K li2024playground.
  • Figure 5: Comparison on ImageNet 256 $\times$ 256, TransDiff(Row 1), TransDiff+DMD(Row 2), TransDiff+Ours(Row 3).
  • ...and 4 more figures