Table of Contents
Fetching ...

Controlling Text-to-Image Diffusion by Orthogonal Finetuning

Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, Bernhard Schölkopf

TL;DR

This work introduces Orthogonal Finetuning (OFT) to adapt large text-to-image diffusion models for downstream tasks while provably preserving hyperspherical energy, i.e., the layer-wise pairwise angular relationships among neurons. By learning layer-shared orthogonal transformations $m{R}$ (with options for block-diagonal structure and Cayley parameterization), OFT keeps the pretrained semantic structure intact, enabling effective subject-driven and controllable generation with high data and training efficiency. A constrained variant, COFT, adds an explicit deviation budget from the pretrained weights to further stabilize finetuning, while extensions like Re-scaled OFT add magnitude scaling without affecting $HE$. Empirically, OFT and COFT outperform DreamBooth, LoRA, and ControlNet in generation quality, convergence speed, and controllability, all without additional inference overhead, highlighting the practical impact of angular information preservation in diffusion-model finetuning. Open problems include speeding up the Cayley-based orthogonalization for very large models, exploring compositionality of multiple orthogonal transforms, and further improving parameter efficiency with bias-minimizing structures.

Abstract

Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.

Controlling Text-to-Image Diffusion by Orthogonal Finetuning

TL;DR

This work introduces Orthogonal Finetuning (OFT) to adapt large text-to-image diffusion models for downstream tasks while provably preserving hyperspherical energy, i.e., the layer-wise pairwise angular relationships among neurons. By learning layer-shared orthogonal transformations (with options for block-diagonal structure and Cayley parameterization), OFT keeps the pretrained semantic structure intact, enabling effective subject-driven and controllable generation with high data and training efficiency. A constrained variant, COFT, adds an explicit deviation budget from the pretrained weights to further stabilize finetuning, while extensions like Re-scaled OFT add magnitude scaling without affecting . Empirically, OFT and COFT outperform DreamBooth, LoRA, and ControlNet in generation quality, convergence speed, and controllability, all without additional inference overhead, highlighting the practical impact of angular information preservation in diffusion-model finetuning. Open problems include speeding up the Cayley-based orthogonalization for very large models, exploring compositionality of multiple orthogonal transforms, and further improving parameter efficiency with bias-minimizing structures.

Abstract

Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.
Paper Structure (38 sections, 5 equations, 34 figures, 6 tables)

This paper contains 38 sections, 5 equations, 34 figures, 6 tables.

Figures (34)

  • Figure 1: (a) Subject-driven generation: OFT preserves the hyperspherical energy and yields more stable finetuning performance across different number of iterations, while both DreamBooth ruiz2023dreambooth and LoRA hulora2022 do not. OFT can preserve hyperspherical energy and perform stable finetuning, while both LoRA and DreamBooth are unable. (b) Controllable generation: OFT is more sample-efficient in training and converges well with only 5% of the original dataset, while both ControlNet zhang2023adding and LoRA hulora2022 cannot converge until 50% of the data is present. The hyperspherical energy comparison between LoRA and OFT is fair, since they finetune the same layers. ControlNet uses a different layer finetuning strategy, so its hyperspherical energy is not comparable. The detailed settings are given in the experiment section and Appendix \ref{['app:settings']}.
  • Figure 2: A toy experiment to demonstrate the importance of angular information. The autoencoder is trained in a standard way using inner product activation, and (a) shows the standard reconstruction. In testing, the angular information of neurons alone can well recover the input image, even if the autoencoder is not trained with angles.
  • Figure 3: Controllable generation with or without orthogonality. Middle column is from the original OFT, and the right column is given by OFT without the orthogonality constraint.
  • Figure 4: (a) Original OFT without a diagonal structure. (b) OFT with $r$ diagonal blocks of the same size. When $r=1$, the case of (b) recovers the case of (a).
  • Figure 5: How $\epsilon$ affects the flexibility of COFT in subject-driven generation.
  • ...and 29 more figures