Table of Contents
Fetching ...

Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models

Xinxi Zhang, Song Wen, Ligong Han, Felix Juefei-Xu, Akash Srivastava, Junzhou Huang, Hao Wang, Molei Tao, Dimitris N. Metaxas

TL;DR

The paper tackles efficient adaptation of large pre-trained diffusion models by proposing Spectrum Aware Fine-Tuning (SODA), which jointly tunes the spectral magnitudes and singular vectors of weight matrices. SODA leverages a spectrum decomposition $\mathbf{W}_0 = \mathbf{W}_0^{spec} \mathbf{W}_0^{basis}$ and updates the spectrum $\Delta\boldsymbol{S}$ alongside an orthogonal basis updated via a Kronecker-structured rotation $\mathbf{R}$ on the Stiefel manifold. It offers two decomposition modalities, SVD-based and QR/LQ-based, to realize parameter-efficient yet expressive fine-tuning, demonstrated on text-to-image diffusion personalization tasks (subject and style) with extensive ablations. The results show that SODA surpasses strong baselines like LoRA and OFT in both fidelity and style-preserving compositional generation, highlighting the value of exploiting spectral priors for high-capacity yet efficient fine-tuning in diffusion models.

Abstract

Adapting large-scale pre-trained generative models in a parameter-efficient manner is gaining traction. Traditional methods like low rank adaptation achieve parameter efficiency by imposing constraints but may not be optimal for tasks requiring high representation capacity. We propose a novel spectrum-aware adaptation framework for generative models. Our method adjusts both singular values and their basis vectors of pretrained weights. Using the Kronecker product and efficient Stiefel optimizers, we achieve parameter-efficient adaptation of orthogonal matrices. We introduce Spectral Orthogonal Decomposition Adaptation (SODA), which balances computational efficiency and representation capacity. Extensive evaluations on text-to-image diffusion models demonstrate SODA's effectiveness, offering a spectrum-aware alternative to existing fine-tuning methods.

Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models

TL;DR

The paper tackles efficient adaptation of large pre-trained diffusion models by proposing Spectrum Aware Fine-Tuning (SODA), which jointly tunes the spectral magnitudes and singular vectors of weight matrices. SODA leverages a spectrum decomposition and updates the spectrum alongside an orthogonal basis updated via a Kronecker-structured rotation on the Stiefel manifold. It offers two decomposition modalities, SVD-based and QR/LQ-based, to realize parameter-efficient yet expressive fine-tuning, demonstrated on text-to-image diffusion personalization tasks (subject and style) with extensive ablations. The results show that SODA surpasses strong baselines like LoRA and OFT in both fidelity and style-preserving compositional generation, highlighting the value of exploiting spectral priors for high-capacity yet efficient fine-tuning in diffusion models.

Abstract

Adapting large-scale pre-trained generative models in a parameter-efficient manner is gaining traction. Traditional methods like low rank adaptation achieve parameter efficiency by imposing constraints but may not be optimal for tasks requiring high representation capacity. We propose a novel spectrum-aware adaptation framework for generative models. Our method adjusts both singular values and their basis vectors of pretrained weights. Using the Kronecker product and efficient Stiefel optimizers, we achieve parameter-efficient adaptation of orthogonal matrices. We introduce Spectral Orthogonal Decomposition Adaptation (SODA), which balances computational efficiency and representation capacity. Extensive evaluations on text-to-image diffusion models demonstrate SODA's effectiveness, offering a spectrum-aware alternative to existing fine-tuning methods.
Paper Structure (17 sections, 12 equations, 16 figures, 1 table)

This paper contains 17 sections, 12 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: SODA achieves superior image quality and text alignment across diverse input images and prompts, such as changing the background, altering the texture, and synthesizing new poses. Additionally, SODA can generate prompt-aligned images in a given style specified by an input style image.
  • Figure 2: Comparison of difference PEFT approaches. (■: frozen parameters; ■: tunable parameters; ■: zeros.)
  • Figure 3: Results for Subject Personalization. Each subfigure consists of 3 samples: a large one on the left and two smaller ones on the right. The text under the input images indicates the class of the personalized subject, while the text prompt under the sample images is used for inference. Our observations indicate that SODA outperforms both LoRA and OFT in generating prompt-aligned images while preserving subject identities at a similar level.
  • Figure 4: Personalized style generation. We show curated samples of ours (SODA-SVD, SODA-QR), SVDiff han2023svdiff (SVD), and LoRA hu2021lora. Independently trained subject and style weights are merged without joint training. SVDiff tends to overfit to the subject and fail to preserve the style well.
  • Figure 5: Compositional generation of my subject in my style. We show visual samples of my subject in my style with different actions or visual attributes specified by the text prompts. Independently trained subject and style weights are merged without joint training.
  • ...and 11 more figures

Theorems & Definitions (3)

  • Remark
  • Remark
  • proof