Table of Contents
Fetching ...

FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time

Yaoli Liu, Yao-Xiang Ding, Kun Zhou

TL;DR

FreeFuse tackles multi-subject generation in diffusion-based text-to-image models by deriving context-aware masks from cross-attention maps and applying them to LoRA outputs at test time. The approach, which requires no training, modifications to LoRAs, external segmentation models, or user-provided region prompts, mitigates inter-LoRA conflicts through attention-sink handling, self-attention locality, and a superpixel-based ensemble voting strategy. The method is evaluated against strong baselines and shows improvements in subject fidelity, prompt adherence, and image quality across challenging interactions. It enables practical, scalable multi-subject generation within standard diffusion workflows.

Abstract

This paper proposes FreeFuse, a novel training-free approach for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to existing methods that either focus on pre-inference LoRA weight merging or rely on segmentation models and complex techniques like noise blending to isolate LoRA outputs, our key insight is that context-aware dynamic subject masks can be automatically derived from cross-attention layer weights. Mathematical analysis shows that directly applying these masks to LoRA outputs during inference well approximates the case where the subject LoRA is integrated into the diffusion model and used individually for the masked region. FreeFuse demonstrates superior practicality and efficiency as it requires no additional training, no modification to LoRAs, no auxiliary models, and no user-defined prompt templates or region specifications. Alternatively, it only requires users to provide the LoRA activation words for seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both generation quality and usability under the multi-subject generation tasks. The project page is at https://future-item.github.io/FreeFuse/

FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time

TL;DR

FreeFuse tackles multi-subject generation in diffusion-based text-to-image models by deriving context-aware masks from cross-attention maps and applying them to LoRA outputs at test time. The approach, which requires no training, modifications to LoRAs, external segmentation models, or user-provided region prompts, mitigates inter-LoRA conflicts through attention-sink handling, self-attention locality, and a superpixel-based ensemble voting strategy. The method is evaluated against strong baselines and shows improvements in subject fidelity, prompt adherence, and image quality across challenging interactions. It enables practical, scalable multi-subject generation within standard diffusion workflows.

Abstract

This paper proposes FreeFuse, a novel training-free approach for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to existing methods that either focus on pre-inference LoRA weight merging or rely on segmentation models and complex techniques like noise blending to isolate LoRA outputs, our key insight is that context-aware dynamic subject masks can be automatically derived from cross-attention layer weights. Mathematical analysis shows that directly applying these masks to LoRA outputs during inference well approximates the case where the subject LoRA is integrated into the diffusion model and used individually for the masked region. FreeFuse demonstrates superior practicality and efficiency as it requires no additional training, no modification to LoRAs, no auxiliary models, and no user-defined prompt templates or region specifications. Alternatively, it only requires users to provide the LoRA activation words for seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both generation quality and usability under the multi-subject generation tasks. The project page is at https://future-item.github.io/FreeFuse/

Paper Structure

This paper contains 21 sections, 13 equations, 63 figures, 2 tables.

Figures (63)

  • Figure 1: This paper proposes FreeFuse, a highly practical method that requires no training, no modifications to existing LoRA models, no external models like segmentation models, and no user-defined prompt templates or region specifications, yet fully unlocks the capability of large DiT models to generate high-quality multi-subject interaction images.
  • Figure 2: An intuitive comparison of results, the prompt is harry-potter tucking a flower in daiyu-lin’s hair, both smiling warmly face-to-face. Our method FreeFuse demonstrates significant advantages in generating complex character interaction scenes.
  • Figure 3: Conflicts Ananlysis
  • Figure 4: Left: Experiments show that removing LoRA from the feedforward (FF) and value (V) layers causes relatively significant semantic loss than removing it from other layers. Right: We randomly downloaded 45 FLUX-based LoRAs from Civitai and sampled 225 images. Results show that disabling the FF or V layers causes a large increase in L2 loss, while other layers have limited effect, indicating that semantic information is primarily injected through the V and FF layers.
  • Figure 5: Pipeline. Our pipeline consists of two stages: the first derives subject masks from attention maps, and the second applies these masks to LoRA outputs, ensuring that each LoRA only operates within its corresponding subject region.
  • ...and 58 more figures