AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

Junjie Shentu; Matthew Watson; Noura Al Moubayed

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

Junjie Shentu, Matthew Watson, Noura Al Moubayed

TL;DR

AttenCraft tackles the challenge of disentangling multiple concepts in text-to-image customization with diffusion models by generating per-concept masks from attention maps in a single initialization step, without external segmentation masks. It introduces an attention-driven adaptive sampling ratio based on cross-attention scores to synchronize learning across concepts, and a feature-retaining training framework that prevents feature fusion when learning multiple concepts. The approach combines mask-based disentanglement with a carefully designed loss strategy, enabling robust multi-concept learning across 16 datasets and generalizing to scenarios with more than two concepts, while achieving state-of-the-art image fidelity and competitive prompt fidelity. This work advances practical subject-driven T2I customization by reducing reliance on human-provided masks and improving stability and quality in multi-concept learning.

Abstract

Text-to-image (T2I) customization empowers users to adapt the T2I diffusion model to new concepts absent in the pre-training dataset. On this basis, capturing multiple new concepts from a single image has emerged as a new task, allowing the model to learn multiple concepts simultaneously or discard unwanted concepts. However, multiple-concept disentanglement remains a key challenge. Existing disentanglement models often exhibit two main issues: feature fusion and asynchronous learning across different concepts. To address these issues, we propose AttenCraft, an attention-based method for multiple-concept disentanglement. Our method uses attention maps to generate accurate masks for each concept in a single initialization step, aiding in concept disentanglement without requiring mask preparation from humans or specialized models. Moreover, we introduce an adaptive algorithm based on attention scores to estimate sampling ratios for different concepts, promoting balanced feature acquisition and synchronized learning. AttenCraft also introduces a feature-retaining training framework that employs various loss functions to enhance feature recognition and prevent fusion. Extensive experiments show that our model effectively mitigates these two issues, achieving state-of-the-art image fidelity and comparable prompt fidelity to baseline models.

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

TL;DR

Abstract

Paper Structure (30 sections, 9 equations, 15 figures, 5 tables)

This paper contains 30 sections, 9 equations, 15 figures, 5 tables.

Introduction
Related Work
Diffusion models and T2I customization
Application of attention in diffusion models
Disentangling multiple concepts from a single image
Proposed Method
Preliminary
Attention-guided mask creation
Adaptive sampling ratio estimation based on attention scores
Identifier token initialization
Attention activation and sampling ratio
Adaptive sampling ratio estimation
Feature-retaining training framework
Experiments
Experimental settings
...and 15 more sections

Figures (15)

Figure 1: We propose AttenCraft, an optimized method for disentangling multiple concepts in a single image. Baseline models present two key issues: (a) feature fusion; (b) asynchronous learning. Our method significantly mitigates these issues and realizes robust concept disentanglement and feature learning.
Figure 2: Method overview. Given an image with multiple concepts, within a few steps in the pre-processing stage, we create accurate masks for each concept and adaptively estimate the sampling ratio for multiple concepts to enhance learning synchronicity. We also propose an optimized training framework by introducing different loss functions for sampled subsets of varying sizes to prevent feature fusion.
Figure 3: Process of attention-guided mask creation. By applying the cross-attention and self-attention maps, precise masks can be created without specialized models or human inputs.
Figure 4: Results of the token initialization experiment. (a) Variation of single-concept CLIP-I scores with training step; (b) The highest cross-attention score of $\rm [V]$ concerning different initialization patterns.
Figure 5: Qualitative results for concept disentanglement and feature fusion.CusDiff cannot disentangle multiple concepts, and both DisenDiff and BAS present feature fusion. Our method not only disentangles the target concepts, but also mitigates the feature fusion problems .
...and 10 more figures

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

TL;DR

Abstract

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

Authors

TL;DR

Abstract

Table of Contents

Figures (15)