Non-confusing Generation of Customized Concepts in Diffusion Models
Wang Lin, Jingyuan Chen, Jiaxin Shi, Yichen Zhu, Chen Liang, Junzhong Miao, Tao Jin, Zhou Zhao, Fei Wu, Shuicheng Yan, Hanwang Zhang
TL;DR
This work tackles inter-concept visual confusion in text-guided diffusion models for customized concept generation with scarce examples by introducing CLIF, a two-stage method that first contrastively fine-tunes the CLIP text encoder to produce non-confusing concept embeddings $V^*$, and then fine-tunes the image decoder to synthesize multi-concept images without cross-concept leakage. It introduces three augmentation strategies—Global, Region, and Mix augmentations—to create diverse image-text pairs for robust contrastive training and to address identity preservation, attribute binding, and concept attendance. Empirical results across 18 customized concepts demonstrate superior multi-concept composition, outperforming state-of-the-art baselines in both qualitative and quantitative evaluations, and ablations confirm the individual benefits of each augmentation. The approach improves controllability and fidelity in compositional generation, with practical implications for personalized content creation while highlighting necessary ethical safeguards against misuse.
Abstract
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs). It becomes even more pronounced in the generation of customized concepts, due to the scarcity of user-provided concept visual examples. By revisiting the two major stages leading to the success of TGDMs -- 1) contrastive image-language pre-training (CLIP) for text encoder that encodes visual semantics, and 2) training TGDM that decodes the textual embeddings into pixels -- we point that existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one. To this end, we propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning. Specifically, given a few samples of customized concepts, we obtain non-confusing textual embeddings of a concept by fine-tuning CLIP via contrasting a concept and the over-segmented visual regions of other concepts. Experimental results demonstrate the effectiveness of CLIF in preventing the confusion of multi-customized concept generation.
