Table of Contents
Fetching ...

FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement

Gia-Nghia Tran, Quang-Huy Che, Trong-Tai Dam Vu, Bich-Nga Pham, Vinh-Tiep Nguyen, Trung-Nghia Le, Minh-Triet Tran

TL;DR

FaR addresses overfitting and attribute leakage in multi-concept text-to-image personalization by combining Concept Fusion data augmentation and Localized Refinement loss. Concept Fusion augments training data by separating subjects from backgrounds, generating priors with class-name diffusion, and composing augmented references and priors into training sets. Localized Refinement aligns each concept's cross-attention to its designated region, reducing leakage between similar subjects and enabling precise multi-subject composition. Empirical results on 24 concepts show FaR outperforms state-of-the-art methods in fidelity and photorealism while maintaining efficiency during inference.

Abstract

Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function. Concept Fusion systematically augments the training data by separating reference subjects from backgrounds and recombining them into composite images to increase diversity. This augmentation technique tackles the overfitting problem by mitigating the narrow distribution of the limited training samples. In addition, Localized Refinement loss function is introduced to preserve subject representative attributes by aligning each concept's attention map to its correct region. This approach effectively prevents attribute leakage by ensuring that the diffusion model distinguishes similar subjects without mixing their attention maps during the denoising process. By fine-tuning specific modules at the same time, FaR balances the learning of new concepts with the retention of previously learned knowledge. Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.

FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement

TL;DR

FaR addresses overfitting and attribute leakage in multi-concept text-to-image personalization by combining Concept Fusion data augmentation and Localized Refinement loss. Concept Fusion augments training data by separating subjects from backgrounds, generating priors with class-name diffusion, and composing augmented references and priors into training sets. Localized Refinement aligns each concept's cross-attention to its designated region, reducing leakage between similar subjects and enabling precise multi-subject composition. Empirical results on 24 concepts show FaR outperforms state-of-the-art methods in fidelity and photorealism while maintaining efficiency during inference.

Abstract

Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function. Concept Fusion systematically augments the training data by separating reference subjects from backgrounds and recombining them into composite images to increase diversity. This augmentation technique tackles the overfitting problem by mitigating the narrow distribution of the limited training samples. In addition, Localized Refinement loss function is introduced to preserve subject representative attributes by aligning each concept's attention map to its correct region. This approach effectively prevents attribute leakage by ensuring that the diffusion model distinguishes similar subjects without mixing their attention maps during the denoising process. By fine-tuning specific modules at the same time, FaR balances the learning of new concepts with the retention of previously learned knowledge. Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.

Paper Structure

This paper contains 17 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of Single-head Cross-Attention in Stable Diffusion. The image $x$ is processed through encoder $\mathcal{E}$, generating a latent representation $z$. A text prompt $p$ is encoded into a text embedding $\tau(p)$. $W^q$, $W^k$, and $W^v$ map the inputs to a query $Q$, key $K$, and value $V$ feature, respectively. The cross-attention map $A$ is multiplied by $V$ to generate features that capture the interaction between image and text.
  • Figure 2: Overview of Concept Fusion. By separating each subject from the background and randomly positioning it on new composite samples, the Concept Fusion augmentation technique enhances the model's ability to differentiate between identities.
  • Figure 3: Our training pipeline is demonstrated using a subset of $k = 2$ subjects. For simplicity, we set the subject IDs as $C_1 = 1$ and $C_2 = 2$. During training, we simultaneously optimize the text encoder, self-attention layers, and cross-attention layers. This approach enables the model to learn detailed features of the new concepts while minimizing the loss of knowledge from the original model.
  • Figure 4: Our dataset of 24 subjects across humans, animals, and objects was used to evaluate personalization methods.
  • Figure 5: Qualitative Comparison of Single-Concept Generation. Our approach (last column) outperforms others by generating visually consistent, contextually accurate representations while preserving target context and reference appearance.
  • ...and 2 more figures