Table of Contents
Fetching ...

On Inductive Biases That Enable Generalization of Diffusion Transformers

Jie An, De Wang, Pengsheng Guo, Jiebo Luo, Alexander Schwing

TL;DR

Experimental results on the CelebA, ImageNet, and LSUN datasets show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available.

Abstract

Recent work studying the generalization of diffusion models with UNet-based denoisers reveals inductive biases that can be expressed via geometry-adaptive harmonic bases. However, in practice, more recent denoising networks are often based on transformers, e.g., the diffusion transformer (DiT). This raises the question: do transformer-based denoising networks exhibit inductive biases that can also be expressed via geometry-adaptive harmonic bases? To our surprise, we find that this is not the case. This discrepancy motivates our search for the inductive bias that can lead to good generalization in DiT models. Investigating the pivotal attention modules of a DiT, we find that locality of attention maps are closely associated with generalization. To verify this finding, we modify the generalization of a DiT by restricting its attention windows. We inject local attention windows to a DiT and observe an improvement in generalization. Furthermore, we empirically find that both the placement and the effective attention size of these local attention windows are crucial factors. Experimental results on the CelebA, ImageNet, and LSUN datasets show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available. Source code will be released publicly upon paper publication. Project page: dit-generalization.github.io/.

On Inductive Biases That Enable Generalization of Diffusion Transformers

TL;DR

Experimental results on the CelebA, ImageNet, and LSUN datasets show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available.

Abstract

Recent work studying the generalization of diffusion models with UNet-based denoisers reveals inductive biases that can be expressed via geometry-adaptive harmonic bases. However, in practice, more recent denoising networks are often based on transformers, e.g., the diffusion transformer (DiT). This raises the question: do transformer-based denoising networks exhibit inductive biases that can also be expressed via geometry-adaptive harmonic bases? To our surprise, we find that this is not the case. This discrepancy motivates our search for the inductive bias that can lead to good generalization in DiT models. Investigating the pivotal attention modules of a DiT, we find that locality of attention maps are closely associated with generalization. To verify this finding, we modify the generalization of a DiT by restricting its attention windows. We inject local attention windows to a DiT and observe an improvement in generalization. Furthermore, we empirically find that both the placement and the effective attention size of these local attention windows are crucial factors. Experimental results on the CelebA, ImageNet, and LSUN datasets show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available. Source code will be released publicly upon paper publication. Project page: dit-generalization.github.io/.

Paper Structure

This paper contains 23 sections, 7 equations, 19 figures, 12 tables.

Figures (19)

  • Figure 1: Jacobian eigenvectors of (a) a simplified one-channel UNet, (b) the UNet introduced in improved diffusion nichol2021improved, and (c) a DiT peebles2023scalable. kadkhodaie2023generalization find that the generalization of a UNet-based diffusion model is driven by geometry-adaptive harmonic bases (a), which display oscillatory patterns whose frequency increases as the eigenvalue $\lambda_k$ decreases. We observe similar harmonic bases in split-channel eigenvectors (b) with standard UNets nichol2021improved. However, a DiT peebles2023scalable does not exhibit such harmonic bases (c), motivating our investigation to find the inductive bias that enables generalization in a DiT. The RGB channels of the split-channel eigenvectors are outlined with red, green, blue boxes, respectively. All models operate directly in the pixel space without applying the patchify operation.
  • Figure 2: The PSNR (a) and PSNR gap (b) comparisons between a UNet and a DiT with the same FLOPs for different training image quantities ($N$). When $N{=}10^5$, both DiT and UNet show small PSNR gaps between the training and testing sets. Nevertheless, when $N{=}10^3$ and $N{=}10^4$, a DiT exhibits smaller PSNR gaps compared to a UNet, indicating a better generalization ability under insufficient training data. All PSNR and PSNR gap curves are averaged over three models trained on different dataset shuffles. The standard deviations, illustrated by the curve shadows in the zoomed-in windows, are negligible, indicating minimal variation.
  • Figure 3: Jacobian eigenvector comparison between UNet nichol2021improved and DiT peebles2023scalable with equivalent FLOPs. (a) The eigenvectors of a UNet tend to memorize the training images when $N{=}10$ and drive the generalization througth harmonic bases kadkhodaie2023generalization when $N{=}10^5$. In contrast, (b) the DiT’s eigenvectors exhibit neither the memorization effect at $N{=}10$ nor harmonic bases at $N{=}10^5$.
  • Figure 4: Attention maps of DiTs trained with $10$, $10^3$, and $10^5$ images. All attention maps are linearly normalized to the range $\left[0, 1\right]$, with a colormap applied to the interval $\left[0, 0.1\right]$ for enhanced visualization. The top-right insets provide a zoomed-in view of the center patch of each attention map. As the number of training images increases, DiT’s generalization improves, and attention maps across all layers exhibit stronger locality. The pink boxes highlight the attention corresponding to a specific output token, obtained by reshaping a single row from the layer-$12$ attention map (original shape: $1{\times}(HW)$) into a matrix of shape $H{\times}W$. As $N$ increases from $10$ to $10^5$, the token attentions progressively concentrate around the region near the output token (highlighted with blue boxes).
  • Figure 5: Global and local attention maps: (a) global attention captures the relationship between the target token and any input token, whereas (b) local attention focuses only on tokens within a nearby window around the target.
  • ...and 14 more figures