Table of Contents
Fetching ...

Effective Diffusion Transformer Architecture for Image Super-Resolution

Kun Cheng, Lei Yu, Zhijun Tu, Xiao He, Liyu Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, Jie Hu

TL;DR

This work designs an effective diffusion transformer for image super resolution (DiT-SR) that achieves the visual quality of prior-based methods, but through a training-from-scratch manner, proving the superiority of diffusion transformer in image super resolution.

Abstract

Recent advances indicate that diffusion models hold great promise in image super-resolution. While the latest methods are primarily based on latent diffusion models with convolutional neural networks, there are few attempts to explore transformers, which have demonstrated remarkable performance in image generation. In this work, we design an effective diffusion transformer for image super-resolution (DiT-SR) that achieves the visual quality of prior-based methods, but through a training-from-scratch manner. In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks across different stages. The former facilitates multi-scale hierarchical feature extraction, while the latter reallocates the computational resources to critical layers to further enhance performance. Moreover, we thoroughly analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module, enhancing the model's capacity to process distinct frequency information at different time steps. Extensive experiments demonstrate that DiT-SR outperforms the existing training-from-scratch diffusion-based SR methods significantly, and even beats some of the prior-based methods on pretrained Stable Diffusion, proving the superiority of diffusion transformer in image super-resolution.

Effective Diffusion Transformer Architecture for Image Super-Resolution

TL;DR

This work designs an effective diffusion transformer for image super resolution (DiT-SR) that achieves the visual quality of prior-based methods, but through a training-from-scratch manner, proving the superiority of diffusion transformer in image super resolution.

Abstract

Recent advances indicate that diffusion models hold great promise in image super-resolution. While the latest methods are primarily based on latent diffusion models with convolutional neural networks, there are few attempts to explore transformers, which have demonstrated remarkable performance in image generation. In this work, we design an effective diffusion transformer for image super-resolution (DiT-SR) that achieves the visual quality of prior-based methods, but through a training-from-scratch manner. In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks across different stages. The former facilitates multi-scale hierarchical feature extraction, while the latter reallocates the computational resources to critical layers to further enhance performance. Moreover, we thoroughly analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module, enhancing the model's capacity to process distinct frequency information at different time steps. Extensive experiments demonstrate that DiT-SR outperforms the existing training-from-scratch diffusion-based SR methods significantly, and even beats some of the prior-based methods on pretrained Stable Diffusion, proving the superiority of diffusion transformer in image super-resolution.
Paper Structure (33 sections, 7 equations, 10 figures, 8 tables)

This paper contains 33 sections, 7 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparisons between the proposed method and the latest SR methods on RealSR dataset. Top: CLIPIQA vs. Parameters. Bottom: CLIPIQA vs. FLOPs. Specifically, "Diff-based SR" refers to diffusion-based image super-resolution methods trained from scratch.
  • Figure 2: Analysis of images generated at different stages with a diffusion-based super-resolution model yue2024resshift. The first row shows the predicted clean images at various steps, while the second row displays the Fourier spectrums of each predicted clean image. The diffusion model initially generates low-frequency components (center part of spectrums) and subsequently generates high-frequency components (peripheral part of spectrums).
  • Figure 3: The comparison from the standard DiT to the proposed DiT-SR. (a): The standard DiT. (b):U-shaped DiT, incorporating downsampling and upsampling to standard DiT and increasing the channel dimension in deep layers. (c): The proposed DiT-SR. This architecture employs a U-shaped global structure, yet maintains the same channel dimension for all transformer blocks in different stages, allocating computational resource to high-resolution layers ($4C_2 > C_3 > C_2$) to boost the model capacity.
  • Figure 4: The percentage of FLOPs and parameters for each stage of the U-shaped DiT, both with and without isotropic design, show that more computational resources are allocated to high-resolution stages.
  • Figure 5: The illustration of transformer block in DiT-SR and Adaptive Frequency Modulation (AdaFM). AdaFM injects the time step into the frequency domain and adaptively reweights different frequency components.
  • ...and 5 more figures