Table of Contents
Fetching ...

IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

Hang Guo, Yawei Li, Tao Dai, Shu-Tao Xia, Luca Benini

TL;DR

IntLoRA tackles the cost and practicality of adapting quantized diffusion models by enabling integer-based low-rank updates that merge with pre-trained weights without post-training quantization. It introduces Adaptation-Quantization Separation (AQS), Multiplicative Low-rank Adaptation (MLA), and Variance Matching Control (VMC) to support end-to-end integer arithmetic, with two deployment variants IntLoRA_MUL and IntLoRA_SHIFT. The approach derives a PTQ-free weight merging framework (e.g., $W' = \mathcal{Q}(W - R) + R + AB$ and its MLA form) and demonstrates strong performance and efficiency across multiple diffusion personalization tasks on consumer hardware. Empirical results show significant training and inference speedups while maintaining or improving accuracy, highlighting practical impact for accessible, personalized diffusion model deployment.

Abstract

Fine-tuning pre-trained diffusion models under limited budgets has gained great success. In particular, the recent advances that directly fine-tune the quantized weights using Low-rank Adaptation (LoRA) further reduces training costs. Despite these progress, we point out that existing adaptation recipes are not inference-efficient. Specifically, additional post-training quantization (PTQ) on tuned weights is needed during deployment, which results in noticeable performance drop when the bit-width is low. Based on this observation, we introduce IntLoRA, which adapts quantized diffusion models with integer-type low-rank parameters, to include inference efficiency during tuning. Specifically, IntLoRA enables pre-trained weights to remain quantized during training, facilitating fine-tuning on consumer-level GPUs. During inference, IntLoRA weights can be seamlessly merged into pre-trained weights to directly obtain quantized downstream weights without PTQ. Extensive experiments show our IntLoRA achieves significant speedup on both training and inference without losing performance.

IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

TL;DR

IntLoRA tackles the cost and practicality of adapting quantized diffusion models by enabling integer-based low-rank updates that merge with pre-trained weights without post-training quantization. It introduces Adaptation-Quantization Separation (AQS), Multiplicative Low-rank Adaptation (MLA), and Variance Matching Control (VMC) to support end-to-end integer arithmetic, with two deployment variants IntLoRA_MUL and IntLoRA_SHIFT. The approach derives a PTQ-free weight merging framework (e.g., and its MLA form) and demonstrates strong performance and efficiency across multiple diffusion personalization tasks on consumer hardware. Empirical results show significant training and inference speedups while maintaining or improving accuracy, highlighting practical impact for accessible, personalized diffusion model deployment.

Abstract

Fine-tuning pre-trained diffusion models under limited budgets has gained great success. In particular, the recent advances that directly fine-tune the quantized weights using Low-rank Adaptation (LoRA) further reduces training costs. Despite these progress, we point out that existing adaptation recipes are not inference-efficient. Specifically, additional post-training quantization (PTQ) on tuned weights is needed during deployment, which results in noticeable performance drop when the bit-width is low. Based on this observation, we introduce IntLoRA, which adapts quantized diffusion models with integer-type low-rank parameters, to include inference efficiency during tuning. Specifically, IntLoRA enables pre-trained weights to remain quantized during training, facilitating fine-tuning on consumer-level GPUs. During inference, IntLoRA weights can be seamlessly merged into pre-trained weights to directly obtain quantized downstream weights without PTQ. Extensive experiments show our IntLoRA achieves significant speedup on both training and inference without losing performance.

Paper Structure

This paper contains 22 sections, 9 equations, 19 figures, 6 tables, 3 algorithms.

Figures (19)

  • Figure 1: (a) The arithmetic inconsistency between the pre-trained and adaptation weights leads to the merged weights still in FP16. Consequently, additional PTQ is needed for low-bit inference. (b) Our IntLoRA allows to work directly on INT4 arithmetic, ensuring the merged weights seamlessly in INT4 format and streamlining the whole process.
  • Figure 2: The utilization of PTQ on the downstream merged weights leads to severe performance degradation under low bit-width quantization.
  • Figure 3: Before tuning, we propose the Adaptation Quantization Separation (AQS) to incorporate auxiliary matrix into pre-trained weights and low-rank weights for zero-initialized but quantization-friendly distribution. Then, the Multiplicative Low-rank Adaptation (MLA) is used to reformulate additive LoRA into the product of the "pre-training term" and the "adaptation term". At last, we introduce the Variance Matching Control (VMC) to adjust the distribution of the adaptation term by modulating the auxiliary matrix. After tuning, we use hardware-friendly integer multiplication or bit shifting to directly generate quantized merged weights without additional PTQ. The detailed algorithm is given in \ref{['sec:suppl-algo']}.
  • Figure 4: Qualitative comparison on subject-driven generation tasks. More results are provided in \ref{['sec:suppl-additonal-viz']}.
  • Figure 5: Qualitative comparison on controllable generation tasks. More results are provided in \ref{['sec:suppl-additonal-viz']}.
  • ...and 14 more figures