Table of Contents
Fetching ...

Multi-LoRA Composition for Image Generation

Ming Zhong, Yelong Shen, Shuohang Wang, Yadong Lu, Yizhu Jiao, Siru Ouyang, Donghan Yu, Jiawei Han, Weizhu Chen

TL;DR

This work tackles the challenge of composing multiple LoRAs in diffusion-based image generation by adopting a decoding-centric view. It introduces two training-free methods, LoRA Switch and LoRA Composite, that operate during the denoising process without altering LoRA weights, and validates them on the ComposLoRA testbed with GPT-4V and human evaluations. The results show clear improvements over traditional LoRA merging, especially as the number of LoRAs grows, along with an analysis of style, activation order, and potential evaluator bias. The study provides practical guidelines for multi-LoRA composition and establishes a benchmark suite for future research in this domain.

Abstract

Low-Rank Adaptation (LoRA) is extensively utilized in text-to-image models for the accurate rendition of specific elements like distinct characters or unique styles in generated images. Nonetheless, existing methods face challenges in effectively composing multiple LoRAs, especially as the number of LoRAs to be integrated grows, thus hindering the creation of complex imagery. In this paper, we study multi-LoRA composition through a decoding-centric perspective. We present two training-free methods: LoRA Switch, which alternates between different LoRAs at each denoising step, and LoRA Composite, which simultaneously incorporates all LoRAs to guide more cohesive image synthesis. To evaluate the proposed approaches, we establish ComposLoRA, a new comprehensive testbed as part of this research. It features a diverse range of LoRA categories with 480 composition sets. Utilizing an evaluation framework based on GPT-4V, our findings demonstrate a clear improvement in performance with our methods over the prevalent baseline, particularly evident when increasing the number of LoRAs in a composition. The code, benchmarks, LoRA weights, and all evaluation details are available on our project website: https://maszhongming.github.io/Multi-LoRA-Composition.

Multi-LoRA Composition for Image Generation

TL;DR

This work tackles the challenge of composing multiple LoRAs in diffusion-based image generation by adopting a decoding-centric view. It introduces two training-free methods, LoRA Switch and LoRA Composite, that operate during the denoising process without altering LoRA weights, and validates them on the ComposLoRA testbed with GPT-4V and human evaluations. The results show clear improvements over traditional LoRA merging, especially as the number of LoRAs grows, along with an analysis of style, activation order, and potential evaluator bias. The study provides practical guidelines for multi-LoRA composition and establishes a benchmark suite for future research in this domain.

Abstract

Low-Rank Adaptation (LoRA) is extensively utilized in text-to-image models for the accurate rendition of specific elements like distinct characters or unique styles in generated images. Nonetheless, existing methods face challenges in effectively composing multiple LoRAs, especially as the number of LoRAs to be integrated grows, thus hindering the creation of complex imagery. In this paper, we study multi-LoRA composition through a decoding-centric perspective. We present two training-free methods: LoRA Switch, which alternates between different LoRAs at each denoising step, and LoRA Composite, which simultaneously incorporates all LoRAs to guide more cohesive image synthesis. To evaluate the proposed approaches, we establish ComposLoRA, a new comprehensive testbed as part of this research. It features a diverse range of LoRA categories with 480 composition sets. Utilizing an evaluation framework based on GPT-4V, our findings demonstrate a clear improvement in performance with our methods over the prevalent baseline, particularly evident when increasing the number of LoRAs in a composition. The code, benchmarks, LoRA weights, and all evaluation details are available on our project website: https://maszhongming.github.io/Multi-LoRA-Composition.
Paper Structure (34 sections, 5 equations, 10 figures, 9 tables)

This paper contains 34 sections, 5 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Multi-LoRA composition techniques effectively blend different elements such as characters, clothing, and objects into a cohesive image. Unlike the conventional LoRA Merge approach ryu_merging_2023, which can lead to detail loss and image distortion as more LoRAs are added, our methods retain the accuracy of each element and the overall image quality.
  • Figure 2: Overview of three multi-LoRA composition techniques, where each colored LoRA represents a distinct element. The prevalent approach, LoRA Merge, linearly merges multiple LoRAs into a single one. In contrast, our methods concentrate on the denoising process: LoRA Switch cycles through different LoRAs during the denoising, while LoRA Composite involves all LoRAs working together as the guidance throughout the generation process.
  • Figure 3: Results of comparative evaluation on ComposLoRA using GPT-4V.
  • Figure 4: Analysis on image styles. In general, LoRA-s is more adept at realistic styles, while LoRA-c has better performance in anime styles.
  • Figure 5: Analysis of the number of denoising steps to switch LoRA and the activation order for LoRA Switch. In Figure \ref{['fig:switch_order']}, "Character" indicates that the character LoRA is activated first, with the rest being activated randomly.
  • ...and 5 more figures