Table of Contents
Fetching ...

CDM-QTA: Quantized Training Acceleration for Efficient LoRA Fine-Tuning of Diffusion Model

Jinming Lu, Minghao She, Wendong Mao, Zhongfeng Wang

TL;DR

The paper addresses the high resource cost of fine-tuning diffusion models for personalized concepts. It combines a LoRA-based fine-tuning scheme that updates only cross-attention projections via $Y = XW + XAB^{T}$ with a fully quantized training pipeline that uses INT8 precision and scales $S = X_{ ext{max}}/(2^{q-1}-1)$. A flexible hardware accelerator with a $64\times64$ systolic PE array supports WS and OS dataflows to maximize utilization during irregular LoRA cross-attention computations. Results show up to $1.81\times$ training speedup and $5.50\times$ energy efficiency improvements over baselines, enabling practical edge deployment of customized diffusion models.

Abstract

Fine-tuning large diffusion models for custom applications demands substantial power and time, which poses significant challenges for efficient implementation on mobile devices. In this paper, we develop a novel training accelerator specifically for Low-Rank Adaptation (LoRA) of diffusion models, aiming to streamline the process and reduce computational complexity. By leveraging a fully quantized training scheme for LoRA fine-tuning, we achieve substantial reductions in memory usage and power consumption while maintaining high model fidelity. The proposed accelerator features flexible dataflow, enabling high utilization for irregular and variable tensor shapes during the LoRA process. Experimental results show up to 1.81x training speedup and 5.50x energy efficiency improvements compared to the baseline, with minimal impact on image generation quality.

CDM-QTA: Quantized Training Acceleration for Efficient LoRA Fine-Tuning of Diffusion Model

TL;DR

The paper addresses the high resource cost of fine-tuning diffusion models for personalized concepts. It combines a LoRA-based fine-tuning scheme that updates only cross-attention projections via with a fully quantized training pipeline that uses INT8 precision and scales . A flexible hardware accelerator with a systolic PE array supports WS and OS dataflows to maximize utilization during irregular LoRA cross-attention computations. Results show up to training speedup and energy efficiency improvements over baselines, enabling practical edge deployment of customized diffusion models.

Abstract

Fine-tuning large diffusion models for custom applications demands substantial power and time, which poses significant challenges for efficient implementation on mobile devices. In this paper, we develop a novel training accelerator specifically for Low-Rank Adaptation (LoRA) of diffusion models, aiming to streamline the process and reduce computational complexity. By leveraging a fully quantized training scheme for LoRA fine-tuning, we achieve substantial reductions in memory usage and power consumption while maintaining high model fidelity. The proposed accelerator features flexible dataflow, enabling high utilization for irregular and variable tensor shapes during the LoRA process. Experimental results show up to 1.81x training speedup and 5.50x energy efficiency improvements compared to the baseline, with minimal impact on image generation quality.

Paper Structure

This paper contains 14 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Cross-Attention module in the custom diffusion model.
  • Figure 2: LoRA fine-tuning for Custom Diffusion model. Only weights in HTML]FAE4D4 pink color are trainable, which accounts for a accounts for a tiny fraction of the entire model.
  • Figure 3: Mixed precision quantization scheme based on LoRA
  • Figure 4: Overview of hardware architecture and dataflow.(a) The hardware architecture of the proposed accelerator. (b) and (c) are the WS and OS dataflows for various computation processes.
  • Figure 5: Comparison of the generation effects of custom diffusion and the quantitative compression model in this article
  • ...and 1 more figures