Table of Contents
Fetching ...

Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales

Shuokai Pan, Gerti Tuzi, Sudarshan Sreeram, Dibakar Gope

TL;DR

This work tackles the high compute and memory demands of large diffusion models by enabling efficient on-device inference through data-free, group-wise quantization of Winograd convolutions. It introduces learnable diagonal scales for the Winograd transform matrices, with S_B and S_G optimized via random Gaussian noise and S_A = (S_B S_G)^{-1}, to mitigate large dynamic-range differences in the Winograd domain without requiring calibration data. The approach preserves image generation quality in 8-bit quantization (near FP16 in FID/CLIP) and improves ImageNet top-1 accuracy on ResNet-18/34 for $F(6,3)$, while delivering substantial CPU runtime gains (~31.3% faster convolutions and ~12.8% faster end-to-end diffusion) on Arm CPUs. Together, these contributions enable practical, on-device diffusion model inference with strong generalization and broad applicability across datasets and tasks.

Abstract

Despite the revolutionary breakthroughs of large-scale text-to-image diffusion models for complex vision and downstream tasks, their extremely high computational and storage costs limit their usability. Quantization of diffusion models has been explored in recent works to reduce compute costs and memory bandwidth usage. To further improve inference time, fast convolution algorithms such as Winograd can be used for convolution layers, which account for a significant portion of computations in diffusion models. However, the significant quality loss of fully quantized Winograd using existing coarser-grained post-training quantization methods, combined with the complexity and cost of finetuning the Winograd transformation matrices for such large models to recover quality, makes them unsuitable for large-scale foundation models. Motivated by the presence of a large range of values in them, we investigate the impact of finer-grained group-wise quantization in quantizing diffusion models. While group-wise quantization can largely handle the fully quantized Winograd convolution, it struggles to deal with the large distribution imbalance in a sizable portion of the Winograd domain computation. To reduce range differences in the Winograd domain, we propose finetuning only the scale parameters of the Winograd transform matrices without using any domain-specific training data. Because our method does not depend on any training data, the generalization performance of quantized diffusion models is safely guaranteed. For text-to-image generation task, the 8-bit fully-quantized diffusion model with Winograd provides near-lossless quality (FID and CLIP scores) in comparison to the full-precision model. For image classification, our method outperforms the state-of-the-art Winograd PTQ method by 1.62% and 2.56% in top-1 ImageNet accuracy on ResNet18 and ResNet-34, respectively, with Winograd F(6, 3).

Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales

TL;DR

This work tackles the high compute and memory demands of large diffusion models by enabling efficient on-device inference through data-free, group-wise quantization of Winograd convolutions. It introduces learnable diagonal scales for the Winograd transform matrices, with S_B and S_G optimized via random Gaussian noise and S_A = (S_B S_G)^{-1}, to mitigate large dynamic-range differences in the Winograd domain without requiring calibration data. The approach preserves image generation quality in 8-bit quantization (near FP16 in FID/CLIP) and improves ImageNet top-1 accuracy on ResNet-18/34 for , while delivering substantial CPU runtime gains (~31.3% faster convolutions and ~12.8% faster end-to-end diffusion) on Arm CPUs. Together, these contributions enable practical, on-device diffusion model inference with strong generalization and broad applicability across datasets and tasks.

Abstract

Despite the revolutionary breakthroughs of large-scale text-to-image diffusion models for complex vision and downstream tasks, their extremely high computational and storage costs limit their usability. Quantization of diffusion models has been explored in recent works to reduce compute costs and memory bandwidth usage. To further improve inference time, fast convolution algorithms such as Winograd can be used for convolution layers, which account for a significant portion of computations in diffusion models. However, the significant quality loss of fully quantized Winograd using existing coarser-grained post-training quantization methods, combined with the complexity and cost of finetuning the Winograd transformation matrices for such large models to recover quality, makes them unsuitable for large-scale foundation models. Motivated by the presence of a large range of values in them, we investigate the impact of finer-grained group-wise quantization in quantizing diffusion models. While group-wise quantization can largely handle the fully quantized Winograd convolution, it struggles to deal with the large distribution imbalance in a sizable portion of the Winograd domain computation. To reduce range differences in the Winograd domain, we propose finetuning only the scale parameters of the Winograd transform matrices without using any domain-specific training data. Because our method does not depend on any training data, the generalization performance of quantized diffusion models is safely guaranteed. For text-to-image generation task, the 8-bit fully-quantized diffusion model with Winograd provides near-lossless quality (FID and CLIP scores) in comparison to the full-precision model. For image classification, our method outperforms the state-of-the-art Winograd PTQ method by 1.62% and 2.56% in top-1 ImageNet accuracy on ResNet18 and ResNet-34, respectively, with Winograd F(6, 3).
Paper Structure (23 sections, 32 equations, 12 figures, 12 tables, 1 algorithm)

This paper contains 23 sections, 32 equations, 12 figures, 12 tables, 1 algorithm.

Figures (12)

  • Figure 1: Group-wise quantization for convolution layers.
  • Figure 2: Group-wise fully quantized Winograd convolution. Applying group-wise quantization to Hadamard product and input transformation has a minimal impact on model quality. However, doing the same with the output transformation leads to a significant drop in model accuracy. Weight transformation can be done offline with high precision.
  • Figure 3: Dynamic ranges across different taps or pixels of the Winograd domain output (Y) are very different. (a) Relative standard deviations at all of the $8 \times 8$ locations of the Winograd domain output tile, obtained from the InstaFlow-0.9B model. (b) Histograms of values at locations (1, 4) and (1, 5). (c) Relative standard deviations after learning Winograd scales.
  • Figure 4: (a) Image generated from FP16 model with AKL. (b) (c) Images generated from W8A8 group-wise quantized standard Winograd convolution, using AKL and TAESD, respectively. (d) Image generated from W8A8 group-wise quantized Winograd convolution with learned scales and AKL. InstaFlow-0.9B model and Winograd F(6, 3) was used.
  • Figure 5: (a) Image generated from FP16 model with AKL. (b) (c) Images generated from W8A8 group-wise quantized standard Winograd convolution, using AKL and TAESD, respectively. (d) Image generated from W8A8 group-wise quantized Winograd convolution with learned scales and AKL. Stable Diffusion V1.5 model and Winograd F(6, 3) was used.
  • ...and 7 more figures