Table of Contents
Fetching ...

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization

Qihao Liu, Zhanpeng Zeng, Ju He, Qihang Yu, Xiaohui Shen, Liang-Chieh Chen

TL;DR

This work tackles image distortion in diffusion-based generation by introducing DiMR, which combines a Multi-Resolution Network with Time-Dependent Layer Normalization. The approach uses a feature cascade across resolutions—employing Transformer blocks at the lowest resolution and ConvNeXt blocks at higher resolutions—along with a parameter-efficient TD-LN to inject time information. Empirically, DiMR achieves state-of-the-art FID scores on ImageNet at 256×256 and 512×512, with notable distortion reductions and favorable efficiency relative to prior Transformer-based diffusion models. The results suggest that multi-resolution denoising and lightweight time conditioning can substantially improve high-fidelity generation, offering a scalable path for future diffusion-model architectures.

Abstract

This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization. Diffusion models have gained prominence for their effectiveness in high-fidelity image generation. While conventional approaches rely on convolutional U-Net architectures, recent Transformer-based designs have demonstrated superior performance and scalability. However, Transformer architectures, which tokenize input data (via "patchification"), face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations concerning token length. While larger patch sizes enable attention computation efficiency, they struggle to capture fine-grained visual details, leading to image distortions. To address this challenge, we propose augmenting the Diffusion model with the Multi-Resolution network (DiMR), a framework that refines features across multiple resolutions, progressively enhancing detail from low to high resolution. Additionally, we introduce Time-Dependent Layer Normalization (TD-LN), a parameter-efficient approach that incorporates time-dependent parameters into layer normalization to inject time information and achieve superior performance. Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, where DiMR-XL variants outperform prior diffusion models, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512. Project page: https://qihao067.github.io/projects/DiMR

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization

TL;DR

This work tackles image distortion in diffusion-based generation by introducing DiMR, which combines a Multi-Resolution Network with Time-Dependent Layer Normalization. The approach uses a feature cascade across resolutions—employing Transformer blocks at the lowest resolution and ConvNeXt blocks at higher resolutions—along with a parameter-efficient TD-LN to inject time information. Empirically, DiMR achieves state-of-the-art FID scores on ImageNet at 256×256 and 512×512, with notable distortion reductions and favorable efficiency relative to prior Transformer-based diffusion models. The results suggest that multi-resolution denoising and lightweight time conditioning can substantially improve high-fidelity generation, offering a scalable path for future diffusion-model architectures.

Abstract

This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization. Diffusion models have gained prominence for their effectiveness in high-fidelity image generation. While conventional approaches rely on convolutional U-Net architectures, recent Transformer-based designs have demonstrated superior performance and scalability. However, Transformer architectures, which tokenize input data (via "patchification"), face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations concerning token length. While larger patch sizes enable attention computation efficiency, they struggle to capture fine-grained visual details, leading to image distortions. To address this challenge, we propose augmenting the Diffusion model with the Multi-Resolution network (DiMR), a framework that refines features across multiple resolutions, progressively enhancing detail from low to high resolution. Additionally, we introduce Time-Dependent Layer Normalization (TD-LN), a parameter-efficient approach that incorporates time-dependent parameters into layer normalization to inject time information and achieve superior performance. Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, where DiMR-XL variants outperform prior diffusion models, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512. Project page: https://qihao067.github.io/projects/DiMR
Paper Structure (23 sections, 7 equations, 18 figures, 8 tables)

This paper contains 23 sections, 7 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: (Top) Randomly sampled $512\times512$ images generated by the proposed DiMR. (Bottom) Random samples of the low visual fidelity$256\times256$ images generated by DiMR and DiT peebles2023scalable. To detect low visual fidelity images for both models, a classifier-based rejection model is employed (with the same rejection rate). DiMR generates images with higher fidelity and less distortion than DiT.
  • Figure 2: Model overview. We propose DiMR that enhances Diffusion models with a Multi-Resolution Network. In the figure, we present the Multi-Resolution Network with three branches. The first branch processes the lowest resolution (4 times smaller than the input size) using powerful Transformer blocks, while the other two branches handle higher resolutions (2 times smaller than the input size and the same size as the input, respectively) using effective ConvNeXt blocks. The network employs a feature cascade framework, progressively upsampling lower-resolution features to higher resolutions to reduce distortion in image generation. The Transformer and ConvNeXt blocks are further enhanced by the proposed Time-Dependent Layer Normalization (TD-LN), detailed in Fig. \ref{['fig:td-ln']}.
  • Figure 3: Principal Component Analysis (PCA) of learned scale and shift parameters in adaLN-Zero peebles2023scalable. We conduct PCA on the learned scale ($\gamma_1$, $\gamma_2$) and shift ($\beta_1$, $\beta_2$) parameters obtained from a parameter-heavy MLP in adaLN-Zero using a pre-trained DiT-XL/2 peebles2023scalable model. The vertical axis represents the explained variance ratio of the corresponding Principal Components (PCs). Our observations reveal that the learned parameters can be largely explained by two principal components, suggesting the potential to approximate them by a simpler function.
  • Figure 4: Time conditioning mechanisms. (Left) adaLN-Zero peebles2023scalable learns scale and shift parameters ($\gamma_i$, $\beta_i$, $\alpha_i$, $i=\{1,2\}$) using parameter-heavy MLPs. (Right) The proposed Time-Dependent Layer Normalization (TD-LN) formulates the LN statistics as functions of time ($\gamma(t)$, $\beta(t)$), making it parameter-efficient.
  • Figure 5: DiMR alleviates distortions and improves visual fidelity. In this figure, we randomly visualize the detected low-fidelity images, identified by a pretrained classifier, which are generated by the best models from the baselines and our DiMR. The first column reports both their FID-50K scores and the proportion of distorted images based on human evaluation. DiMR demonstrates better generation performance and lower distortion rates than the baselines.
  • ...and 13 more figures