Table of Contents
Fetching ...

Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

Zanlin Ni, Yulin Wang, Renping Zhou, Jiayi Guo, Jinyi Hu, Zhiyuan Liu, Shiji Song, Yuan Yao, Gao Huang

TL;DR

The paper tackles the efficiency-gap between diffusion models and non-autoregressive transformers (NATs) in image synthesis by reframing NAT design as a unified hyperparameter optimization problem. It introduces AutoNAT, which uses an alternating optimization procedure to jointly optimize training and generation strategies, including a Beta-distributed mask ratio and scheduling hyperparameters. Empirically, AutoNAT achieves diffusion-competitive fidelity on ImageNet and other benchmarks while requiring substantially fewer inference computations, and it demonstrates strong transferability across model sizes. The results suggest that systematic optimization can unlock the full potential of NATs for fast, scalable image synthesis with practical deployment advantages.

Abstract

The field of image synthesis is currently flourishing due to the advancements in diffusion models. While diffusion models have been successful, their computational intensity has prompted the pursuit of more efficient alternatives. As a representative work, non-autoregressive Transformers (NATs) have been recognized for their rapid generation. However, a major drawback of these models is their inferior performance compared to diffusion models. In this paper, we aim to re-evaluate the full potential of NATs by revisiting the design of their training and inference strategies. Specifically, we identify the complexities in properly configuring these strategies and indicate the possible sub-optimality in existing heuristic-driven designs. Recognizing this, we propose to go beyond existing methods by directly solving the optimal strategies in an automatic framework. The resulting method, named AutoNAT, advances the performance boundaries of NATs notably, and is able to perform comparably with the latest diffusion models at a significantly reduced inference cost. The effectiveness of AutoNAT is validated on four benchmark datasets, i.e., ImageNet-256 & 512, MS-COCO, and CC3M. Our code is available at https://github.com/LeapLabTHU/ImprovedNAT.

Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

TL;DR

The paper tackles the efficiency-gap between diffusion models and non-autoregressive transformers (NATs) in image synthesis by reframing NAT design as a unified hyperparameter optimization problem. It introduces AutoNAT, which uses an alternating optimization procedure to jointly optimize training and generation strategies, including a Beta-distributed mask ratio and scheduling hyperparameters. Empirically, AutoNAT achieves diffusion-competitive fidelity on ImageNet and other benchmarks while requiring substantially fewer inference computations, and it demonstrates strong transferability across model sizes. The results suggest that systematic optimization can unlock the full potential of NATs for fast, scalable image synthesis with practical deployment advantages.

Abstract

The field of image synthesis is currently flourishing due to the advancements in diffusion models. While diffusion models have been successful, their computational intensity has prompted the pursuit of more efficient alternatives. As a representative work, non-autoregressive Transformers (NATs) have been recognized for their rapid generation. However, a major drawback of these models is their inferior performance compared to diffusion models. In this paper, we aim to re-evaluate the full potential of NATs by revisiting the design of their training and inference strategies. Specifically, we identify the complexities in properly configuring these strategies and indicate the possible sub-optimality in existing heuristic-driven designs. Recognizing this, we propose to go beyond existing methods by directly solving the optimal strategies in an automatic framework. The resulting method, named AutoNAT, advances the performance boundaries of NATs notably, and is able to perform comparably with the latest diffusion models at a significantly reduced inference cost. The effectiveness of AutoNAT is validated on four benchmark datasets, i.e., ImageNet-256 & 512, MS-COCO, and CC3M. Our code is available at https://github.com/LeapLabTHU/ImprovedNAT.
Paper Structure (29 sections, 6 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 29 sections, 6 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: FID-50K vs. computational cost on ImageNet-256. For fair comparisons, diffusion models are equipped with DPM-Solver lu2022dpmlu2022dpmp for efficient synthesis.
  • Figure 2: The generation process of non-autoregressive Transformers starts from an entirely masked canvas and parallelly decodes multiple tokens at each step. The generated tokens are then mapped to the pixel space with a pre-trained VQ-decoder esser2021taming.
  • Figure 3: The heuristic design of $p(r)$ in existing works: the density of $p(r)$ reflects the frequency of mask ratios encountered during generation. We take $T\!=\!12$ for example. Notably, as shown in Table \ref{['tab:comp_fixed']}, such a heuristic design is sub-optimal.
  • Figure 4: Sampling efficiency on ImageNet-256 and ImageNet-512. LDM is not included in ImageNet-512 results as it is only trained on ImageNet-256. GPU time is measured on an A100 GPU with batch size 50. CPU time is measured on Xeon 8358 CPU with batch size 1. $^\dagger$: DPM-Solver lu2022dpm augmented diffusion models.
  • Figure 5: Selected visualizations of AutoNAT. Samples are generated in 8 steps with AutoNAT-L on ImageNet-512 and ImageNet-256.