Table of Contents
Fetching ...

Clockwork Diffusion: Efficient Generation With Model-Step Distillation

Amirhossein Habibian, Amir Ghodrati, Noor Fathima, Guillaume Sautiere, Risheek Garrepalli, Fatih Porikli, Jens Petersen

TL;DR

This work proposes Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps and demonstrates that Clockwork leads to compa-rable or improved perceptual scores with drastically re-duced computational complexity.

Abstract

This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step, we identify that not all operations are equally relevant for the final output quality. In particular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations. In contrast, low-res feature maps influence the semantic layout of the final image and can often be perturbed with no noticeable change in the output. Based on this observation, we propose Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple baselines, and for both text-to-image generation and image editing, we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity. As an example, for Stable Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change.

Clockwork Diffusion: Efficient Generation With Model-Step Distillation

TL;DR

This work proposes Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps and demonstrates that Clockwork leads to compa-rable or improved perceptual scores with drastically re-duced computational complexity.

Abstract

This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step, we identify that not all operations are equally relevant for the final output quality. In particular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations. In contrast, low-res feature maps influence the semantic layout of the final image and can often be perturbed with no noticeable change in the output. Based on this observation, we propose Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple baselines, and for both text-to-image generation and image editing, we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity. As an example, for Stable Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change.
Paper Structure (50 sections, 5 equations, 20 figures, 7 tables, 1 algorithm)

This paper contains 50 sections, 5 equations, 20 figures, 7 tables, 1 algorithm.

Figures (20)

  • Figure 1: Time savings with Clockwork, for different baselines. All pairs have roughly constant FID (computed on MS-COCO 2017 5K validation set), using 8 sampling steps (DPM++). Clockwork can be applied on top of standard models as well as heavily optimized ones. Timings computed on NVIDIA® RTX® 3080 at batch size 1 (for distilled model) or 2 (for classifier-free guidance). Prompt: "the bust of a man's head is next to a vase of flowers".
  • Figure 2: Perturbing Stable Diffusion v1.5 UNet representations (outputs of the three upsampling layers), starting from different sampling steps (20 DPM++ steps total, note the reference image as inset in lower-right). Perturbing low-resolution features after only a small number of steps has a comparatively small impact on the final output, whereas perturbation of higher-res features results in high-frequency artifacts. Prompt: "image of an astronaut riding a horse on mars."
  • Figure 3: Schematic view of Clockwork. It can be thought of as a combination of model distillation and step distillation. We replace the lower-resolution parts of the UNet $\bm{\epsilon}$ with a more lightweight adaptor, and at the same time give it access to features from the previous sampling step. Contrary to common step distillation, which constructs latents by forward noising images, we train with sampling trajectories unrolled from pure noise. Other modules are conditioned on text and time embeddings (omitted for readability). The gray panel illustrates the difference between regular distillation and our proposed training with unrolled trajectories.
  • Figure 4: Clockwork improves text-to-image generation efficiency consistently over various diffusion models. Models are evaluated on $512\times512$MS-COCO 2017-5K validation set.
  • Figure 5: Text guided generations by SD UNet without (top) and with (bottom) Clockwork at 8 sampling steps (DPM++). Clockwork reduces FLOPs by $32\%$ at a similar generation quality. Prompts given in \ref{['sec:appendix:t2i_examples']}.
  • ...and 15 more figures