Table of Contents
Fetching ...

MultLFG: Training-free Multi-LoRA composition using Frequency-domain Guidance

Aniket Roy, Maitreya Suin, Ketul Shah, Rama Chellappa

TL;DR

MultLFG tackles the challenge of training-free multi-LoRA composition by introducing frequency-guided, adaptive fusion in the wavelet domain. By decomposing latent and image representations into multi-scale frequency subbands and applying timestep-aware, adaptive weights to top-k LoRAs per subband, it reduces concept interference and improves compositional fidelity. The approach achieves consistent gains on the ComposLoRA benchmark over multiple baselines, validated with CLIP-based metrics, GPT-4V evaluations, and human studies, while detailing ablations that underline the contribution of frequency guidance and adaptive merging. This work offers a practical, training-free pathway to more controllable and reliable multi-concept image synthesis in diffusion models.

Abstract

Low-Rank Adaptation (LoRA) has gained prominence as a computationally efficient method for fine-tuning generative models, enabling distinct visual concept synthesis with minimal overhead. However, current methods struggle to effectively merge multiple LoRA adapters without training, particularly in complex compositions involving diverse visual elements. We introduce MultLFG, a novel framework for training-free multi-LoRA composition that utilizes frequency-domain guidance to achieve adaptive fusion of multiple LoRAs. Unlike existing methods that uniformly aggregate concept-specific LoRAs, MultLFG employs a timestep and frequency subband adaptive fusion strategy, selectively activating relevant LoRAs based on content relevance at specific timesteps and frequency bands. This frequency-sensitive guidance not only improves spatial coherence but also provides finer control over multi-LoRA composition, leading to more accurate and consistent results. Experimental evaluations on the ComposLoRA benchmark reveal that MultLFG substantially enhances compositional fidelity and image quality across various styles and concept sets, outperforming state-of-the-art baselines in multi-concept generation tasks. Code will be released.

MultLFG: Training-free Multi-LoRA composition using Frequency-domain Guidance

TL;DR

MultLFG tackles the challenge of training-free multi-LoRA composition by introducing frequency-guided, adaptive fusion in the wavelet domain. By decomposing latent and image representations into multi-scale frequency subbands and applying timestep-aware, adaptive weights to top-k LoRAs per subband, it reduces concept interference and improves compositional fidelity. The approach achieves consistent gains on the ComposLoRA benchmark over multiple baselines, validated with CLIP-based metrics, GPT-4V evaluations, and human studies, while detailing ablations that underline the contribution of frequency guidance and adaptive merging. This work offers a practical, training-free pathway to more controllable and reliable multi-concept image synthesis in diffusion models.

Abstract

Low-Rank Adaptation (LoRA) has gained prominence as a computationally efficient method for fine-tuning generative models, enabling distinct visual concept synthesis with minimal overhead. However, current methods struggle to effectively merge multiple LoRA adapters without training, particularly in complex compositions involving diverse visual elements. We introduce MultLFG, a novel framework for training-free multi-LoRA composition that utilizes frequency-domain guidance to achieve adaptive fusion of multiple LoRAs. Unlike existing methods that uniformly aggregate concept-specific LoRAs, MultLFG employs a timestep and frequency subband adaptive fusion strategy, selectively activating relevant LoRAs based on content relevance at specific timesteps and frequency bands. This frequency-sensitive guidance not only improves spatial coherence but also provides finer control over multi-LoRA composition, leading to more accurate and consistent results. Experimental evaluations on the ComposLoRA benchmark reveal that MultLFG substantially enhances compositional fidelity and image quality across various styles and concept sets, outperforming state-of-the-art baselines in multi-concept generation tasks. Code will be released.

Paper Structure

This paper contains 15 sections, 1 theorem, 13 equations, 5 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1

Let $x$ represent an image composed of multiple concepts $\{c_i\}_{i=1}^n$. Consider wavelet decomposition of the image into frequency-specific subbands $b \in \{LL, LH, HL, HH\}$: $x = \sum_{b \in \{LL,LH,HL,HH\}} x_b.$ Define interference between concepts $c_i$ and $c_j$ within a subband $b$ as: Then, interference in the frequency domain is strictly lower than the spatial domain interference:

Figures (5)

  • Figure 1: Existing LoRA composition methods (composite, switch zhong2024multi) generally uses equal and uniform contribution of each concept LoRAs across denoising timesteps, incurring concept mixing or erasure. MultLFG performs multi-LoRA composition using Wavelet based frequency guidance and adaptive fusion providing better compositionality.
  • Figure 2: Frequency analysis. Low-frequency components (LL) are prominant in early timesteps (T=200), whereas high-frequency components (HL, LH, HH) are more prominant in later timesteps (T=0) during denoising.
  • Figure 3: Overview of MultLFG. (1) Per-LoRA noise is predicted from current and previous timesteps. (2) DWT is performed on denoised image and latent, (3) Temporal differences across consecutive timesteps are calculated followed by normalization by concept area, (4) Changes in images are scaled to changes in latent, (5) Adpative weights are computed based on importance of top-k LoRA, (6) These weights guide the weighted multi-LoRA composition in the wavelet domain for frequency-based guidance. The final image is generated by applying the IDWT and VAE decoding.
  • Figure 4: Comparison on multi-LoRA composition for realistic images. We observe concept mix (color of tie exchanged with color of skirt in 4th, 5th row), or concept vanish (tie/ dress is missing in 4th row) in baselines, whereas MultLFG (last column) minimizes concept mix or vanish, while maintaining quality.
  • Figure 5: Comparison for anime images. Implausible concept composition (burger is floating in 2nd row baselines), or concept erasure (different dress in 3rd row, either dress or background inconsistent in 4th row) happens in baselines. MultLFG combines concepts better while maintaining quality (last column).

Theorems & Definitions (1)

  • Lemma 1: Frequency decomposition reduces interference