Table of Contents
Fetching ...

Orthogonal Adaptation for Modular Customization of Diffusion Models

Ryan Po, Guandao Yang, Kfir Aberman, Gordon Wetzstein

TL;DR

This work introduces Orthogonal Adaptation for modular customization of diffusion models, allowing independently fine-tuned concepts to be merged instantly without retraining or increased computation. By factorizing concept residuals as $\Delta \theta_i = A_iB_i^T$ with $B_i$ kept fixed and enforcing near-orthogonality across concepts ($B_i^TB_j\approx 0$), the method minimizes crosstalk and preserves identity during multi-concept synthesis. The authors propose practical strategies for constructing $B_i$ (randomized orthogonal basis or randomized Gaussian), and demonstrate through extensive experiments that their approach yields superior identity fidelity and efficiency compared with baselines like FedAvg, DreamBooth-LoRA, and Mix-of-Show. The results show near-instantaneous merging, high identity preservation, and scalability to multiple concepts, marking a significant step toward private, scalable modular customization of diffusion models. This has practical implications for user-centric, privacy-aware content generation across personalized concepts.

Abstract

Customization techniques for text-to-image models have paved the way for a wide range of previously unattainable applications, enabling the generation of specific concepts across diverse contexts and styles. While existing methods facilitate high-fidelity customization for individual concepts or a limited, pre-defined set of them, they fall short of achieving scalability, where a single model can seamlessly render countless concepts. In this paper, we address a new problem called Modular Customization, with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs. To address this problem, we introduce Orthogonal Adaptation, a method designed to encourage the customized models, which do not have access to each other during fine-tuning, to have orthogonal residual weights. This ensures that during inference time, the customized models can be summed with minimal interference. Our proposed method is both simple and versatile, applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations, our method consistently outperforms relevant baselines in terms of efficiency and identity preservation, demonstrating a significant leap toward scalable customization of diffusion models.

Orthogonal Adaptation for Modular Customization of Diffusion Models

TL;DR

This work introduces Orthogonal Adaptation for modular customization of diffusion models, allowing independently fine-tuned concepts to be merged instantly without retraining or increased computation. By factorizing concept residuals as with kept fixed and enforcing near-orthogonality across concepts (), the method minimizes crosstalk and preserves identity during multi-concept synthesis. The authors propose practical strategies for constructing (randomized orthogonal basis or randomized Gaussian), and demonstrate through extensive experiments that their approach yields superior identity fidelity and efficiency compared with baselines like FedAvg, DreamBooth-LoRA, and Mix-of-Show. The results show near-instantaneous merging, high identity preservation, and scalability to multiple concepts, marking a significant step toward private, scalable modular customization of diffusion models. This has practical implications for user-centric, privacy-aware content generation across personalized concepts.

Abstract

Customization techniques for text-to-image models have paved the way for a wide range of previously unattainable applications, enabling the generation of specific concepts across diverse contexts and styles. While existing methods facilitate high-fidelity customization for individual concepts or a limited, pre-defined set of them, they fall short of achieving scalability, where a single model can seamlessly render countless concepts. In this paper, we address a new problem called Modular Customization, with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs. To address this problem, we introduce Orthogonal Adaptation, a method designed to encourage the customized models, which do not have access to each other during fine-tuning, to have orthogonal residual weights. This ensures that during inference time, the customized models can be summed with minimal interference. Our proposed method is both simple and versatile, applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations, our method consistently outperforms relevant baselines in terms of efficiency and identity preservation, demonstrating a significant leap toward scalable customization of diffusion models.
Paper Structure (39 sections, 2 theorems, 7 equations, 12 figures, 2 tables)

This paper contains 39 sections, 2 theorems, 7 equations, 12 figures, 2 tables.

Key Result

Theorem 8.1

Let $\mathbf{v}\in \mathbb{R}^d$ and $\mathbf{u}\in \mathbb{R}^d$ be two random vectors. Let $\mathbf{v}_i\sim \mathcal{N}(0, \sigma^2 I)$ and $\mathbf{u}_i\sim \mathcal{N}(0, \sigma^2 I)$ for all $i\in [1, d]$ independently, then $\mathbb{E}\left[ \mathbf{v}^T\mathbf{u}\right] = 0$.

Figures (12)

  • Figure 1: Modular Customization of Diffusion Models. Given a large set of individual concepts (left), the goal of Modular Customization is to enable independent customization (fine-tuning) per concept, while efficiently merging a subset of customized models during inference, so that the corresponding concepts can be jointly synthesized without compromising fidelity. To tackle this, we propose Orthogonal Adaptation, which encourages customized weights of one concept to be orthogonal to the customized weights of others.
  • Figure 2: Gallery of multi-concept generations. Our method enables efficient merging of individually fine-tuned concepts for modular, efficient multi-concept customization of text-to-image diffusion models. Each concept shown above was fine-tuned individually using orthogonal adaptation. Fine-tuned weight residuals are then merged via summation, enabling multi-concept generation.
  • Figure 3: The three stages of Modular Customization: (a) Independent Customization, (b) Modular Combination, and (c) Joint Synthesis. Note that during individual fine-tuning, all processes are private, meaning each user does not have access to ground truth data for other concepts.
  • Figure 4: Overview of Orthogonal Adaptation. (a) LoRA hu2022lora enables training of both low-rank decomposed matrices. (b) Orthogonal adaption constrains training only to $A$, leaving $B$ fixed. (c) For two separate concepts, $i$ and $j$, an orthogonality constraint is imposed between $B_i$ and $B_j$. (d) When concepts $i$ and $j$ are trained independently, approximate orthogonality between $B_i$ and $B_j$ can be achieved by sampling random columns from a shared orthogonal matrix. (e) Without the orthogonality constraint, correlated concepts suffer from "crosstalk" when merged; with the orthogonality constraint, orthogonal concepts preserve their identities after merging.
  • Figure 5: Over-parameterization of text-to-image models. Despite the added constraint on the trained weight residuals, due to the over-paramterized nature of large text-to-image diffusion models, our method is able to achieve single-concept customization results with comparable fidelity to the unconstrained setting.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Theorem 8.1
  • proof
  • Corollary 8.1.1
  • proof