Table of Contents
Fetching ...

K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

Ziheng Ouyang, Zhen Li, Qibin Hou

TL;DR

K-LoRA addresses the challenge of training-free fusion of subject and style LoRAs in diffusion-based image generation. It introduces a Top-K, step-aware attention fusion mechanism that selectively combines content and style LoRAs per layer, guided by the sums of the dominant Top-K elements and diffusion-time scaling, with a gamma-balanced factor to harmonize sources. The approach achieves superior quantitative alignment (StyleSim, CLIP, DINO) and robust qualitative results compared to training-based baselines, while requiring no retraining. This method enables precise stylized content fusion across diverse styles and subjects, with practical applicability to existing LoRA weights and diffusion models.

Abstract

Recent studies have explored combining different LoRAs to jointly generate learned style and content. However, existing methods either fail to effectively preserve both the original subject and style simultaneously or require additional training. In this paper, we argue that the intrinsic properties of LoRA can effectively guide diffusion models in merging learned subject and style. Building on this insight, we propose K-LoRA, a simple yet effective training-free LoRA fusion approach. In each attention layer, K-LoRA compares the Top-K elements in each LoRA to be fused, determining which LoRA to select for optimal fusion. This selection mechanism ensures that the most representative features of both subject and style are retained during the fusion process, effectively balancing their contributions. Experimental results demonstrate that the proposed method effectively integrates the subject and style information learned by the original LoRAs, outperforming state-of-the-art training-based approaches in both qualitative and quantitative results.

K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

TL;DR

K-LoRA addresses the challenge of training-free fusion of subject and style LoRAs in diffusion-based image generation. It introduces a Top-K, step-aware attention fusion mechanism that selectively combines content and style LoRAs per layer, guided by the sums of the dominant Top-K elements and diffusion-time scaling, with a gamma-balanced factor to harmonize sources. The approach achieves superior quantitative alignment (StyleSim, CLIP, DINO) and robust qualitative results compared to training-based baselines, while requiring no retraining. This method enables precise stylized content fusion across diverse styles and subjects, with practical applicability to existing LoRA weights and diffusion models.

Abstract

Recent studies have explored combining different LoRAs to jointly generate learned style and content. However, existing methods either fail to effectively preserve both the original subject and style simultaneously or require additional training. In this paper, we argue that the intrinsic properties of LoRA can effectively guide diffusion models in merging learned subject and style. Building on this insight, we propose K-LoRA, a simple yet effective training-free LoRA fusion approach. In each attention layer, K-LoRA compares the Top-K elements in each LoRA to be fused, determining which LoRA to select for optimal fusion. This selection mechanism ensures that the most representative features of both subject and style are retained during the fusion process, effectively balancing their contributions. Experimental results demonstrate that the proposed method effectively integrates the subject and style information learned by the original LoRAs, outperforming state-of-the-art training-based approaches in both qualitative and quantitative results.

Paper Structure

This paper contains 16 sections, 11 equations, 18 figures, 2 tables, 1 algorithm.

Figures (18)

  • Figure 1: Visual results of findings. (a) Results fine-tuned using only content LoRA. (b) Results only using style LoRA. In these experiments, we test the differences between adding the LoRA layers in the initial and latter timesteps.
  • Figure 2: Experimental visualization results. (a) Generated images by randomly loading a portion of the LoRA attention layers according to a certain ratio. (b) Visualization of LoRA data distribution from different sources: one trained locally and the other one downloaded from a community repository.
  • Figure 3: Overview of the proposed K-LoRA. We propose K-LoRA, which utilizes the Top-K function to select the important LoRA weights in each forward layer based on the sum of matrix elements. This enables us to preserve both stylistic details and object features.
  • Figure 4: Radio visualization. This image reflects the ratio results after summing the Top-K elements, where the ratio differences at each corresponding position are quite significant.
  • Figure 5: LoRA selection during the generation process. This figure illustrates the selection within each forward layer. The vertical axis represents the total 50 diffusion steps, while the horizontal axis indicates the number of LoRA layers. The color at each position denotes the selected layer. Blue bars correspond to objects, and green bars correspond to styles.
  • ...and 13 more figures