AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization
Junjie Shentu, Matthew Watson, Noura Al Moubayed
TL;DR
AttenCraft tackles the challenge of disentangling multiple concepts in text-to-image customization with diffusion models by generating per-concept masks from attention maps in a single initialization step, without external segmentation masks. It introduces an attention-driven adaptive sampling ratio based on cross-attention scores to synchronize learning across concepts, and a feature-retaining training framework that prevents feature fusion when learning multiple concepts. The approach combines mask-based disentanglement with a carefully designed loss strategy, enabling robust multi-concept learning across 16 datasets and generalizing to scenarios with more than two concepts, while achieving state-of-the-art image fidelity and competitive prompt fidelity. This work advances practical subject-driven T2I customization by reducing reliance on human-provided masks and improving stability and quality in multi-concept learning.
Abstract
Text-to-image (T2I) customization empowers users to adapt the T2I diffusion model to new concepts absent in the pre-training dataset. On this basis, capturing multiple new concepts from a single image has emerged as a new task, allowing the model to learn multiple concepts simultaneously or discard unwanted concepts. However, multiple-concept disentanglement remains a key challenge. Existing disentanglement models often exhibit two main issues: feature fusion and asynchronous learning across different concepts. To address these issues, we propose AttenCraft, an attention-based method for multiple-concept disentanglement. Our method uses attention maps to generate accurate masks for each concept in a single initialization step, aiding in concept disentanglement without requiring mask preparation from humans or specialized models. Moreover, we introduce an adaptive algorithm based on attention scores to estimate sampling ratios for different concepts, promoting balanced feature acquisition and synchronized learning. AttenCraft also introduces a feature-retaining training framework that employs various loss functions to enhance feature recognition and prevent fusion. Extensive experiments show that our model effectively mitigates these two issues, achieving state-of-the-art image fidelity and comparable prompt fidelity to baseline models.
