Table of Contents
Fetching ...

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin

TL;DR

This work tackles the slow generation of autoregressive (AR) models by presenting Distilled Decoding (DD), a flow matching–based distillation framework that converts Gaussian noise directly into AR-consistent outputs, enabling one- or two-step image generation for state-of-the-art AR models. DD constructs deterministic AR trajectories via flow matching and then distills these trajectories into a neural predictor that can generate full samples from noise with minimal dependence on the original training data. The method delivers dramatic speedups (e.g., $6.3\times$ for VAR and $217.8\times$ for LlamaGen in 1-step generation) with manageable fidelity losses, and extends to text-to-image tasks with substantial acceleration and controlled quality. By enabling flexible trade-offs between speed and quality, DD challenges the notion that autoregressive models must be inherently slow and opens pathways for efficient AR generation in vision and beyond.

Abstract

Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn't need the training data of the original AR model, making it more practical. We evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3$\times$ speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8$\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID>100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at https://imagination-research.github.io/distilled-decoding.

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

TL;DR

This work tackles the slow generation of autoregressive (AR) models by presenting Distilled Decoding (DD), a flow matching–based distillation framework that converts Gaussian noise directly into AR-consistent outputs, enabling one- or two-step image generation for state-of-the-art AR models. DD constructs deterministic AR trajectories via flow matching and then distills these trajectories into a neural predictor that can generate full samples from noise with minimal dependence on the original training data. The method delivers dramatic speedups (e.g., for VAR and for LlamaGen in 1-step generation) with manageable fidelity losses, and extends to text-to-image tasks with substantial acceleration and controlled quality. By enabling flexible trade-offs between speed and quality, DD challenges the notion that autoregressive models must be inherently slow and opens pathways for efficient AR generation in vision and beyond.

Abstract

Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn't need the training data of the original AR model, making it more practical. We evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3 speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8 speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID>100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at https://imagination-research.github.io/distilled-decoding.

Paper Structure

This paper contains 25 sections, 1 theorem, 12 equations, 17 figures, 2 tables, 4 algorithms.

Key Result

Proposition 3.1

The optimal solution for eq:onestep_obj is $\hat{p}_{\theta^* jk}=\frac{\sum_{i=1}^{N}p_{ijk}}{N}$

Figures (17)

  • Figure 1: Qualitative comparisons between DD and vanilla LlamaGen llamagen on ImageNet 256$\times$256. We show that the generated images of DD have small quality loss compared to the pre-trained AR model, while achieving $\geq$200$\times$ speedup. More examples are in \ref{['app:visualization']}.
  • Figure 2: Qualitative results of DD-2step on text-to-image task. The model is distilled from LlamaGen model with prompts from LAION-COCO dataset. The speedup is around 93 $\times$ compared to the teacher model. More examples are in \ref{['app:visualization']}.
  • Figure 3: Comparison of DD models, pre-trained models, and other acceleration methods for pre-trained models. DD achieves significant speedup compared to pre-trained models with comparable performance. In contrast, other methods' performance degrades quickly as inference time decreases.
  • Figure 4: High-level comparison between our Distilled Decoding (DD) and prior work. To generate a sequence of tokens $q_i$: (a) the vanilla AR model generates token-by-token, thus being slow; (b) parallel decoding generates multiple tokens in parallel (\ref{['sec:decrease_sampling_ar']}), which fundamentally cannot match the generated distribution of the original AR model with one-step generation (see \ref{['sec:non_trivial']}); (c) our DD maps noise tokens $\epsilon_i$ from Gaussian distribution to the whole sequence of generated tokens directly in one step and it is guaranteed that (in the optimal case) the distribution of generated tokens matches that of the original AR model.
  • Figure 5: AR flow matching. Given all previous tokens, the teacher AR model gives a probability vector for the next token, which defines a mixture of Dirac delta distributions over all tokens in the codebook. We then construct a deterministic mapping between the Gaussian distribution and the Dirac delta distribution with flow matching. The next noise token $\epsilon_4$ is sampled from the Gaussian distribution, and its corresponding token in the codebook becomes the next token $q_4$.
  • ...and 12 more figures

Theorems & Definitions (2)

  • Proposition 3.1
  • proof