Table of Contents
Fetching ...

Efficient Personalization of Quantized Diffusion Model without Backpropagation

Hoigi Seo, Wongi Jeong, Kyungryeol Lee, Se Young Chun

TL;DR

This work tackles the memory bottlenecks of personalizing diffusion models by introducing ZOODiP, a framework that personalizes quantized diffusion models using zeroth-order optimization performed with forward passes only. It combines three key innovations: learning a target concept via a quantized model, Subspace Gradient to suppress noisy gradient directions, and Partial Uniform Timestep Sampling to focus updates on timesteps where text embeddings matter most. The approach yields memory reductions of up to $8.2\times$ with competitive image-text alignment, enabling on-device personalization on edge devices. The combination of quantization, gradient-free optimization, and subspace-aware updates provides a practical path toward privacy-preserving, personalized diffusion generation on resource-constrained hardware.

Abstract

Diffusion models have shown remarkable performance in image synthesis, but they demand extensive computational and memory resources for training, fine-tuning and inference. Although advanced quantization techniques have successfully minimized memory usage for inference, training and fine-tuning these quantized models still require large memory possibly due to dequantization for accurate computation of gradients and/or backpropagation for gradient-based algorithms. However, memory-efficient fine-tuning is particularly desirable for applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consumes considerable memory. Since a gradient estimation using zeroth-order optimization is quite noisy for a single or a few images in personalization, we propose to denoise the estimated gradient by projecting it onto a subspace that is constructed with the past history of the tokens, dubbed Subspace Gradient. In addition, we investigated the influence of text embedding in image generation, leading to our proposed time steps sampling, dubbed Partial Uniform Timestep Sampling for sampling with effective diffusion timesteps. Our method achieves comparable performance to prior methods in image and text alignment scores for personalizing Stable Diffusion with only forward passes while reducing training memory demand up to $8.2\times$.

Efficient Personalization of Quantized Diffusion Model without Backpropagation

TL;DR

This work tackles the memory bottlenecks of personalizing diffusion models by introducing ZOODiP, a framework that personalizes quantized diffusion models using zeroth-order optimization performed with forward passes only. It combines three key innovations: learning a target concept via a quantized model, Subspace Gradient to suppress noisy gradient directions, and Partial Uniform Timestep Sampling to focus updates on timesteps where text embeddings matter most. The approach yields memory reductions of up to with competitive image-text alignment, enabling on-device personalization on edge devices. The combination of quantization, gradient-free optimization, and subspace-aware updates provides a practical path toward privacy-preserving, personalized diffusion generation on resource-constrained hardware.

Abstract

Diffusion models have shown remarkable performance in image synthesis, but they demand extensive computational and memory resources for training, fine-tuning and inference. Although advanced quantization techniques have successfully minimized memory usage for inference, training and fine-tuning these quantized models still require large memory possibly due to dequantization for accurate computation of gradients and/or backpropagation for gradient-based algorithms. However, memory-efficient fine-tuning is particularly desirable for applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consumes considerable memory. Since a gradient estimation using zeroth-order optimization is quite noisy for a single or a few images in personalization, we propose to denoise the estimated gradient by projecting it onto a subspace that is constructed with the past history of the tokens, dubbed Subspace Gradient. In addition, we investigated the influence of text embedding in image generation, leading to our proposed time steps sampling, dubbed Partial Uniform Timestep Sampling for sampling with effective diffusion timesteps. Our method achieves comparable performance to prior methods in image and text alignment scores for personalizing Stable Diffusion with only forward passes while reducing training memory demand up to .

Paper Structure

This paper contains 42 sections, 12 equations, 18 figures, 15 tables, 2 algorithms.

Figures (18)

  • Figure 1: Analysis of memory consumption and performance of Stable Diffusion personalization methods. (Left) GPU memory breakdown for each method on a Stable Diffusion personalization with a batch size of 1. ZOODiP (Ours) shows significantly higher memory efficiency compared to other methods. (Right) Comparison of memory usage versus performance across methods. Performance is measured with text (CLIP-T) and image (CLIP-I) alignment scores. ZOODiP achieves comparable performance to other methods while using significantly less memory (up to $8.2\times$ less than DreamBooth). Memory usage was profiled using the PyTorch profiler and nvidia-smi command.
  • Figure 2: (a) Illustration of overall ZOODiP framework. A target token is initialized and added to the prompt. Reference images are encoded, and Partial Uniform Timestep Sampling (PUTS)-sampled timestep noise is predicted. The loss is calculated with the original and perturbed token to estimate the gradient. (b) Illustration of Subspace Gradient (SG). Updated tokens from the previous $\tau$ iterations are stored. PCA identifies low-variance eigenvectors to project out noisy dimensions from the estimated gradient for the next $\tau$ iterations.
  • Figure 3: Sparse effective dimension in the token trained with Textual Inversion. Notably, the concept was preserved even when retaining only one-third of the optimized dimensions ($k=256$).
  • Figure 4: Textual Inversion gal2022image with various timestep sampling. When the timestep $t$ for training is sampled from $U(0,500)$, key conceptual features such as color and body shape of the reference image are not effectively trained. In contrast, sampling from $U(500,1000)$ results in successful learning of these features.
  • Figure 5: Qualitative comparison of image and text alignment. This figure shows how well each method generates images that match the input text prompt while preserving the identity of the personalized subject. ZOODiP generates images that faithfully reflect the prompt while maintaining the concept of the reference image, demonstrating strong image-text alignment.
  • ...and 13 more figures