Table of Contents
Fetching ...

SparseDM: Toward Sparse Efficient Diffusion Models

Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, Jun Zhu

TL;DR

SparseDM tackles the slow inference of diffusion models on resource-constrained devices by introducing 2:4 structured sparsity in Conv and Linear layers and training sparse networks via transfer-learning with an enhanced STE. By fixing sparsity during training and transferring knowledge from a dense model, SparseDM preserves sample quality (FID) while halving MACs and achieving about 1.2x GPU acceleration. The approach combines 2:4 sparse masks, STE-based sparse training with a regularization term, and GPU-accelerated sparse operators to enable efficient diffusion-based generation on Transformer- and UNet-backbones. This work demonstrates practical acceleration on widely used diffusion architectures with modest trade-offs in quality and outlines directions for processor-specific optimizations and broader applicability.

Abstract

Diffusion models represent a powerful family of generative models widely used for image and video generation. However, the time-consuming deployment, long inference time, and requirements on large memory hinder their applications on resource constrained devices. In this paper, we propose a method based on the improved Straight-Through Estimator to improve the deployment efficiency of diffusion models. Specifically, we add sparse masks to the Convolution and Linear layers in a pre-trained diffusion model, then transfer learn the sparse model during the fine-tuning stage and turn on the sparse masks during inference. Experimental results on a Transformer and UNet-based diffusion models demonstrate that our method reduces MACs by 50% while maintaining FID. Sparse models are accelerated by approximately 1.2x on the GPU. Under other MACs conditions, the FID is also lower than 1 compared to other methods.

SparseDM: Toward Sparse Efficient Diffusion Models

TL;DR

SparseDM tackles the slow inference of diffusion models on resource-constrained devices by introducing 2:4 structured sparsity in Conv and Linear layers and training sparse networks via transfer-learning with an enhanced STE. By fixing sparsity during training and transferring knowledge from a dense model, SparseDM preserves sample quality (FID) while halving MACs and achieving about 1.2x GPU acceleration. The approach combines 2:4 sparse masks, STE-based sparse training with a regularization term, and GPU-accelerated sparse operators to enable efficient diffusion-based generation on Transformer- and UNet-backbones. This work demonstrates practical acceleration on widely used diffusion architectures with modest trade-offs in quality and outlines directions for processor-specific optimizations and broader applicability.

Abstract

Diffusion models represent a powerful family of generative models widely used for image and video generation. However, the time-consuming deployment, long inference time, and requirements on large memory hinder their applications on resource constrained devices. In this paper, we propose a method based on the improved Straight-Through Estimator to improve the deployment efficiency of diffusion models. Specifically, we add sparse masks to the Convolution and Linear layers in a pre-trained diffusion model, then transfer learn the sparse model during the fine-tuning stage and turn on the sparse masks during inference. Experimental results on a Transformer and UNet-based diffusion models demonstrate that our method reduces MACs by 50% while maintaining FID. Sparse models are accelerated by approximately 1.2x on the GPU. Under other MACs conditions, the FID is also lower than 1 compared to other methods.
Paper Structure (20 sections, 5 equations, 5 figures, 3 tables)

This paper contains 20 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Framework Overview. This includes the process of transfer learning sparse models.
  • Figure 2: Image generation results of 2:4 sparse U-ViT: selected samples on MS-COCO 256$\times$256, ImageNet 256$\times$256, on CIFAR10 32$\times$32, and CelebA 64$\times$64.
  • Figure 3: The comparison of sparsity results.
  • Figure 4: Add sparse mask to each layer
  • Figure 5: Dense and sparse matrix on GPU.