Table of Contents
Fetching ...

Cached Multi-Lora Composition for Multi-Concept Image Generation

Xiandong Zou, Mingzhu Shen, Christos-Savvas Bouganis, Yiren Zhao

TL;DR

The paper tackles the problem of composing multiple LoRAs for accurate multi-concept image generation. It introduces a Fourier-domain profiling approach to categorize LoRAs into high- and low-frequency groups and proposes a training-free Cached Multi-LoRA (CMLoRA) framework that sequences dominant LoRAs by frequency and caches non-dominant contributions to minimize semantic conflicts. Empirical results on the ComposLoRA testbed show CMLoRA achieving higher CLIPScore and MiniCPM-V-based win rates than state-of-the-art baselines, albeit with higher computational cost, and demonstrate the value of non-uniform caching and frequency-guided dominance scheduling. The work also proposes an automated evaluation pipeline via MiniCPM-V to assess element integration, spatial consistency, semantic accuracy, and aesthetic quality, providing a robust benchmark for multi-LoRA fusion. Overall, the approach offers a practical, training-free route to scalable, coherent multi-concept generation with broad implications for controllable image synthesis.

Abstract

Low-Rank Adaptation (LoRA) has emerged as a widely adopted technique in text-to-image models, enabling precise rendering of multiple distinct elements, such as characters and styles, in multi-concept image generation. However, current approaches face significant challenges when composing these LoRAs for multi-concept image generation, resulting in diminished generated image quality. In this paper, we initially investigate the role of LoRAs in the denoising process through the lens of the Fourier frequency domain. Based on the hypothesis that applying multiple LoRAs could lead to "semantic conflicts", we find that certain LoRAs amplify high-frequency features such as edges and textures, whereas others mainly focus on low-frequency elements, including the overall structure and smooth color gradients. Building on these insights, we devise a frequency domain based sequencing strategy to determine the optimal order in which LoRAs should be integrated during inference. This strategy offers a methodical and generalizable solution compared to the naive integration commonly found in existing LoRA fusion techniques. To fully leverage our proposed LoRA order sequence determination method in multi-LoRA composition tasks, we introduce a novel, training-free framework, Cached Multi-LoRA (CMLoRA), designed to efficiently integrate multiple LoRAs while maintaining cohesive image generation. With its flexible backbone for multi-LoRA fusion and a non-uniform caching strategy tailored to individual LoRAs, CMLoRA has the potential to reduce semantic conflicts in LoRA composition and improve computational efficiency. Our experimental evaluations demonstrate that CMLoRA outperforms state-of-the-art training-free LoRA fusion methods by a significant margin -- it achieves an average improvement of $2.19\%$ in CLIPScore, and $11.25\%$ in MLLM win rate compared to LoraHub, LoRA Composite, and LoRA Switch.

Cached Multi-Lora Composition for Multi-Concept Image Generation

TL;DR

The paper tackles the problem of composing multiple LoRAs for accurate multi-concept image generation. It introduces a Fourier-domain profiling approach to categorize LoRAs into high- and low-frequency groups and proposes a training-free Cached Multi-LoRA (CMLoRA) framework that sequences dominant LoRAs by frequency and caches non-dominant contributions to minimize semantic conflicts. Empirical results on the ComposLoRA testbed show CMLoRA achieving higher CLIPScore and MiniCPM-V-based win rates than state-of-the-art baselines, albeit with higher computational cost, and demonstrate the value of non-uniform caching and frequency-guided dominance scheduling. The work also proposes an automated evaluation pipeline via MiniCPM-V to assess element integration, spatial consistency, semantic accuracy, and aesthetic quality, providing a robust benchmark for multi-LoRA fusion. Overall, the approach offers a practical, training-free route to scalable, coherent multi-concept generation with broad implications for controllable image synthesis.

Abstract

Low-Rank Adaptation (LoRA) has emerged as a widely adopted technique in text-to-image models, enabling precise rendering of multiple distinct elements, such as characters and styles, in multi-concept image generation. However, current approaches face significant challenges when composing these LoRAs for multi-concept image generation, resulting in diminished generated image quality. In this paper, we initially investigate the role of LoRAs in the denoising process through the lens of the Fourier frequency domain. Based on the hypothesis that applying multiple LoRAs could lead to "semantic conflicts", we find that certain LoRAs amplify high-frequency features such as edges and textures, whereas others mainly focus on low-frequency elements, including the overall structure and smooth color gradients. Building on these insights, we devise a frequency domain based sequencing strategy to determine the optimal order in which LoRAs should be integrated during inference. This strategy offers a methodical and generalizable solution compared to the naive integration commonly found in existing LoRA fusion techniques. To fully leverage our proposed LoRA order sequence determination method in multi-LoRA composition tasks, we introduce a novel, training-free framework, Cached Multi-LoRA (CMLoRA), designed to efficiently integrate multiple LoRAs while maintaining cohesive image generation. With its flexible backbone for multi-LoRA fusion and a non-uniform caching strategy tailored to individual LoRAs, CMLoRA has the potential to reduce semantic conflicts in LoRA composition and improve computational efficiency. Our experimental evaluations demonstrate that CMLoRA outperforms state-of-the-art training-free LoRA fusion methods by a significant margin -- it achieves an average improvement of in CLIPScore, and in MLLM win rate compared to LoraHub, LoRA Composite, and LoRA Switch.

Paper Structure

This paper contains 49 sections, 17 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: The denoising process with a Character LoRA and a Background LoRA. The plot illustrates the difference in amplitude of high-frequency components $\Delta\mathcal{H}_{0.2}\left(\overline{\mathbf{x}}_{t};40\right)$ between $40$-step interval generated by the Character LoRA and Background LoRA after the inverse Fourier Transform, matching each step $t$.
  • Figure 2: Observation: Prompt-only generation (Naive) and existing LoRA combination methods (Merge and Switch) often lead to semantic conflicts. This failure primarily arises because independent LoRAs are integrated to contribute equally to image generation during the denoising process. CMLoRA employs a frequency-domain-based LoRA scheduling mechanism to integrate multiple concept LoRAs, effectively addressing semantic conflicts.
  • Figure 3: Summary of the change in amplitude of high-frequency components, $\overline{\Delta\mathcal{H}_{0.2}\left(\overline{\mathbf{x}}_{t};20\right)}$, during the denoising process for generated images with LoRAs across different LoRA categories.
  • Figure 4: Overview of our multi-LoRA composition framework during a $7$-step denoising process. Each color represents a distinct LoRA, where solid shapes indicate dominant LoRAs performing full inference, and hollow shapes represent non-dominant LoRAs leveraging the caching mechanism at their respective steps. The weight scale $w_{dom_{i}}$ on each dominant LoRA signifies its influence during the denoising process, where $w_{dom_{0}}=w_{dom_{1}}>\cdots>w_{dom_{5}}$.
  • Figure 5: Character LoRA and Background LoRA composition. Visual artifacts (green flowers) appear in the image generated by LoRA Composite framework, as illustrated in \ref{['sec:decoding']}. Introducing the caching mechanism can alleviate the semantic conflict we have here.
  • ...and 14 more figures