Table of Contents
Fetching ...

The Surprising Effectiveness of Skip-Tuning in Diffusion Sampling

Jiajun Ma, Shuchen Xue, Tianyang Hu, Wenjia Wang, Zhaoqiang Liu, Zhenguo Li, Zhi-Ming Ma, Kenji Kawaguchi

TL;DR

The paper addresses the limitation imposed by UNet skip connections on the complexity of transformations in diffusion sampling. It introduces Skip-Tuning, a simple, training-free per-layer down-skip scaling method that significantly improves sample quality, even beating the ODE-sampling limit on ImageNet-64 EDM with as few as 19 NFEs and surpassing EDM-2 with 39 NFEs. Through analyses of pixel- versus feature-space score matching, time-dependent skip schedules, and inversion via MMD, the authors reveal that Skip-Tuning enhances discriminative feature-space estimates and aligns inverted noise more closely with Gaussian noise. The method generalizes across architectures (EDM, LDM, UViT) and offers a practical, architecture-agnostic approach to unlocking diffusion model performance without additional training.

Abstract

With the incorporation of the UNet architecture, diffusion probabilistic models have become a dominant force in image generation tasks. One key design in UNet is the skip connections between the encoder and decoder blocks. Although skip connections have been shown to improve training stability and model performance, we reveal that such shortcuts can be a limiting factor for the complexity of the transformation. As the sampling steps decrease, the generation process and the role of the UNet get closer to the push-forward transformations from Gaussian distribution to the target, posing a challenge for the network's complexity. To address this challenge, we propose Skip-Tuning, a simple yet surprisingly effective training-free tuning method on the skip connections. Our method can achieve 100% FID improvement for pretrained EDM on ImageNet 64 with only 19 NFEs (1.75), breaking the limit of ODE samplers regardless of sampling steps. Surprisingly, the improvement persists when we increase the number of sampling steps and can even surpass the best result from EDM-2 (1.58) with only 39 NFEs (1.57). Comprehensive exploratory experiments are conducted to shed light on the surprising effectiveness. We observe that while Skip-Tuning increases the score-matching losses in the pixel space, the losses in the feature space are reduced, particularly at intermediate noise levels, which coincide with the most effective range accounting for image quality improvement.

The Surprising Effectiveness of Skip-Tuning in Diffusion Sampling

TL;DR

The paper addresses the limitation imposed by UNet skip connections on the complexity of transformations in diffusion sampling. It introduces Skip-Tuning, a simple, training-free per-layer down-skip scaling method that significantly improves sample quality, even beating the ODE-sampling limit on ImageNet-64 EDM with as few as 19 NFEs and surpassing EDM-2 with 39 NFEs. Through analyses of pixel- versus feature-space score matching, time-dependent skip schedules, and inversion via MMD, the authors reveal that Skip-Tuning enhances discriminative feature-space estimates and aligns inverted noise more closely with Gaussian noise. The method generalizes across architectures (EDM, LDM, UViT) and offers a practical, architecture-agnostic approach to unlocking diffusion model performance without additional training.

Abstract

With the incorporation of the UNet architecture, diffusion probabilistic models have become a dominant force in image generation tasks. One key design in UNet is the skip connections between the encoder and decoder blocks. Although skip connections have been shown to improve training stability and model performance, we reveal that such shortcuts can be a limiting factor for the complexity of the transformation. As the sampling steps decrease, the generation process and the role of the UNet get closer to the push-forward transformations from Gaussian distribution to the target, posing a challenge for the network's complexity. To address this challenge, we propose Skip-Tuning, a simple yet surprisingly effective training-free tuning method on the skip connections. Our method can achieve 100% FID improvement for pretrained EDM on ImageNet 64 with only 19 NFEs (1.75), breaking the limit of ODE samplers regardless of sampling steps. Surprisingly, the improvement persists when we increase the number of sampling steps and can even surpass the best result from EDM-2 (1.58) with only 39 NFEs (1.57). Comprehensive exploratory experiments are conducted to shed light on the surprising effectiveness. We observe that while Skip-Tuning increases the score-matching losses in the pixel space, the losses in the feature space are reduced, particularly at intermediate noise levels, which coincide with the most effective range accounting for image quality improvement.
Paper Structure (23 sections, 14 equations, 20 figures, 15 tables)

This paper contains 23 sections, 14 equations, 20 figures, 15 tables.

Figures (20)

  • Figure 1: The UNet demonstration figure.
  • Figure 2: The layerwise down-sampling skip to up-sampling vectors $l_2$ norm proportion.
  • Figure 3: The gradient $l_2$ norm changes with skip coefficient $\rho$.
  • Figure 4: ODE UniPC sampling results of different skip coefficients and steps.
  • Figure 5: The left-hand side 64x64 figures are sampled from ODE 10 steps (FID: 3.64); the right-hand side figures are sampled from ODE 10 steps with Skip-Tuning $\rho=0.78$ (FID: 1.88).
  • ...and 15 more figures

Theorems & Definitions (4)

  • Definition 3.1: Skip-Tuning
  • Remark 3.2: Beyond existing architecture
  • Remark 4.1
  • Remark 3.1