Table of Contents
Fetching ...

Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

Joo Young Choi, Jaesung R. Park, Inkyu Park, Jaewoong Cho, Albert No, Ernest K. Ryu

TL;DR

This work shows that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality.

Abstract

Current state-of-the-art diffusion models employ U-Net architectures containing convolutional and (qkv) self-attention layers. The U-Net processes images while being conditioned on the time embedding input for each sampling step and the class or caption embedding input corresponding to the desired conditional generation. Such conditioning involves scale-and-shift operations to the convolutional layers but does not directly affect the attention layers. While these standard architectural choices are certainly effective, not conditioning the attention layers feels arbitrary and potentially suboptimal. In this work, we show that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality. For example, a drop-in addition of LoRA conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for unconditional and class-conditional CIFAR-10 generation, improving upon the baseline of 1.97/1.79.

Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

TL;DR

This work shows that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality.

Abstract

Current state-of-the-art diffusion models employ U-Net architectures containing convolutional and (qkv) self-attention layers. The U-Net processes images while being conditioned on the time embedding input for each sampling step and the class or caption embedding input corresponding to the desired conditional generation. Such conditioning involves scale-and-shift operations to the convolutional layers but does not directly affect the attention layers. While these standard architectural choices are certainly effective, not conditioning the attention layers feels arbitrary and potentially suboptimal. In this work, we show that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality. For example, a drop-in addition of LoRA conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for unconditional and class-conditional CIFAR-10 generation, improving upon the baseline of 1.97/1.79.
Paper Structure (62 sections, 16 equations, 20 figures, 3 tables)

This paper contains 62 sections, 16 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: The standard U-Net architecture for diffusion models conditions convolutional layers in residual blocks with scale-and-shift but does not condition attention blocks. Simply adding LoRA conditioning on attention layers improves the image generation quality.
  • Figure 2: Conditioning of U-Net Block: (left) scale-and-shift conditioning on the convolutional block (middle) LoRA conditioning on the attention block (right)top: TimeLoRA and ClassLoRA for the discrete-time setting, bottom: unified composition LoRA for the continuous-SNR setting.
  • Figure 3: MNIST samples generated by nano diffusion trained with (1st row) conventional scale-and-shift conditioning; (2nd row) TimeLoRA with linear interpolation initialization; (3rd row) UC-LoRA; and (4th row) TimeLoRA with random initialization.
  • Figure 4: Cosine similarity between $\omega(t_1)$ and $\omega(t_2)$ for UC-LoRA applied to nano diffusion (left) at initialization and (right) after training. At initialization, the cosine similarity between $\omega(t_1)$ and $\omega(t_2)$ has no discernible pattern. After training, however, the cosine similarity between $\omega(t_1)$ and $\omega(t_2)$ for $t_1\approx t_2$ is close to 1, implying their high similarity.
  • Figure 5: Results of (Top) interpolation of class labels in class-conditional EDM with (row1) ClassLoRA; (row2) scale-and-shift; (bottom) extrapolation of class labels in class-conditional EDM with (row1) ClassLoRA; (row2) scale-and-shift
  • ...and 15 more figures