Table of Contents
Fetching ...

LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong Yang, Qinglin Lu, Deng Cai, Boxi Wu, Wei Liu

TL;DR

LoRA-Composer tackles multi-concept customization in diffusion models by enabling training-free fusion of multiple LoRAs through region-aware cross-attention, concept isolation in self-attention, and latent re-initialization to provide region priors. It directly addresses concept vanishing and concept confusion without reliance on image-based conditioning or fusion training. The approach demonstrates superior image-text fidelity and robust multi-concept generation across diverse subjects and styles, with ablations confirming the importance of each component. This yields a flexible, scalable method for composing complex scenes with multiple concepts using only textual/layout cues and lightweight LoRA modules.

Abstract

Customization generation techniques have significantly advanced the synthesis of specific concepts across varied contexts. Multi-concept customization emerges as the challenging task within this domain. Existing approaches often rely on training a fusion matrix of multiple Low-Rank Adaptations (LoRAs) to merge various concepts into a single image. However, we identify this straightforward method faces two major challenges: 1) concept confusion, where the model struggles to preserve distinct individual characteristics, and 2) concept vanishing, where the model fails to generate the intended subjects. To address these issues, we introduce LoRA-Composer, a training-free framework designed for seamlessly integrating multiple LoRAs, thereby enhancing the harmony among different concepts within generated images. LoRA-Composer addresses concept vanishing through concept injection constraints, enhancing concept visibility via an expanded cross-attention mechanism. To combat concept confusion, concept isolation constraints are introduced, refining the self-attention computation. Furthermore, latent re-initialization is proposed to effectively stimulate concept-specific latent within designated regions. Our extensive testing showcases a notable enhancement in LoRA-Composer's performance compared to standard baselines, especially when eliminating the image-based conditions like canny edge or pose estimations. Code is released at \url{https://github.com/Young98CN/LoRA_Composer}

LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

TL;DR

LoRA-Composer tackles multi-concept customization in diffusion models by enabling training-free fusion of multiple LoRAs through region-aware cross-attention, concept isolation in self-attention, and latent re-initialization to provide region priors. It directly addresses concept vanishing and concept confusion without reliance on image-based conditioning or fusion training. The approach demonstrates superior image-text fidelity and robust multi-concept generation across diverse subjects and styles, with ablations confirming the importance of each component. This yields a flexible, scalable method for composing complex scenes with multiple concepts using only textual/layout cues and lightweight LoRA modules.

Abstract

Customization generation techniques have significantly advanced the synthesis of specific concepts across varied contexts. Multi-concept customization emerges as the challenging task within this domain. Existing approaches often rely on training a fusion matrix of multiple Low-Rank Adaptations (LoRAs) to merge various concepts into a single image. However, we identify this straightforward method faces two major challenges: 1) concept confusion, where the model struggles to preserve distinct individual characteristics, and 2) concept vanishing, where the model fails to generate the intended subjects. To address these issues, we introduce LoRA-Composer, a training-free framework designed for seamlessly integrating multiple LoRAs, thereby enhancing the harmony among different concepts within generated images. LoRA-Composer addresses concept vanishing through concept injection constraints, enhancing concept visibility via an expanded cross-attention mechanism. To combat concept confusion, concept isolation constraints are introduced, refining the self-attention computation. Furthermore, latent re-initialization is proposed to effectively stimulate concept-specific latent within designated regions. Our extensive testing showcases a notable enhancement in LoRA-Composer's performance compared to standard baselines, especially when eliminating the image-based conditions like canny edge or pose estimations. Code is released at \url{https://github.com/Young98CN/LoRA_Composer}
Paper Structure (26 sections, 11 equations, 13 figures, 3 tables)

This paper contains 26 sections, 11 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Our method distinguishes itself from Mix-of-Show gu2023mixofshow by eliminating the image-based conditions and the requirement to train a LoRA fusion matrix. Furthermore, we highlight the limitations of Mix-of-Show through the demonstration of failure cases. In the top row, we illustrate two key issues: concept vanishing, marked by the absence of intended concepts in the image, and concept confusion, where the model mistakenly merges and confuses distinct concepts.
  • Figure 2: (a) LoRA-Composer utilizes textual, layout, and image-based conditions (optional) to integrate multiple LoRAs. (b) Modifications to the U-Net in LoRA-Composer Block include concept isolation in self-attention and concept injection in cross-attention. At timestep $t$, $z_t$ is first refined via $\mathcal{L}$ to ensure appearance consistency and prevent feature leakage, followed by the denoising process.
  • Figure 3: Modules of LoRA-Composer Block: (a) region-aware LoRA injection, (b) layout condition, (c) concept region mask, self-attention in the gray area is not calculated.
  • Figure 4: Three highlights of LoRA-Composer, a) full image customization in multi-style; b) manipulating interactions and attributes; c) multi-condition control generation.
  • Figure 5: Qualitative comparison with baselines. For each case, we use the same seeds.
  • ...and 8 more figures