Table of Contents
Fetching ...

Non-confusing Generation of Customized Concepts in Diffusion Models

Wang Lin, Jingyuan Chen, Jiaxin Shi, Yichen Zhu, Chen Liang, Junzhong Miao, Tao Jin, Zhou Zhao, Fei Wu, Shuicheng Yan, Hanwang Zhang

TL;DR

This work tackles inter-concept visual confusion in text-guided diffusion models for customized concept generation with scarce examples by introducing CLIF, a two-stage method that first contrastively fine-tunes the CLIP text encoder to produce non-confusing concept embeddings $V^*$, and then fine-tunes the image decoder to synthesize multi-concept images without cross-concept leakage. It introduces three augmentation strategies—Global, Region, and Mix augmentations—to create diverse image-text pairs for robust contrastive training and to address identity preservation, attribute binding, and concept attendance. Empirical results across 18 customized concepts demonstrate superior multi-concept composition, outperforming state-of-the-art baselines in both qualitative and quantitative evaluations, and ablations confirm the individual benefits of each augmentation. The approach improves controllability and fidelity in compositional generation, with practical implications for personalized content creation while highlighting necessary ethical safeguards against misuse.

Abstract

We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs). It becomes even more pronounced in the generation of customized concepts, due to the scarcity of user-provided concept visual examples. By revisiting the two major stages leading to the success of TGDMs -- 1) contrastive image-language pre-training (CLIP) for text encoder that encodes visual semantics, and 2) training TGDM that decodes the textual embeddings into pixels -- we point that existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one. To this end, we propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning. Specifically, given a few samples of customized concepts, we obtain non-confusing textual embeddings of a concept by fine-tuning CLIP via contrasting a concept and the over-segmented visual regions of other concepts. Experimental results demonstrate the effectiveness of CLIF in preventing the confusion of multi-customized concept generation.

Non-confusing Generation of Customized Concepts in Diffusion Models

TL;DR

This work tackles inter-concept visual confusion in text-guided diffusion models for customized concept generation with scarce examples by introducing CLIF, a two-stage method that first contrastively fine-tunes the CLIP text encoder to produce non-confusing concept embeddings , and then fine-tunes the image decoder to synthesize multi-concept images without cross-concept leakage. It introduces three augmentation strategies—Global, Region, and Mix augmentations—to create diverse image-text pairs for robust contrastive training and to address identity preservation, attribute binding, and concept attendance. Empirical results across 18 customized concepts demonstrate superior multi-concept composition, outperforming state-of-the-art baselines in both qualitative and quantitative evaluations, and ablations confirm the individual benefits of each augmentation. The approach improves controllability and fidelity in compositional generation, with practical implications for personalized content creation while highlighting necessary ethical safeguards against misuse.

Abstract

We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs). It becomes even more pronounced in the generation of customized concepts, due to the scarcity of user-provided concept visual examples. By revisiting the two major stages leading to the success of TGDMs -- 1) contrastive image-language pre-training (CLIP) for text encoder that encodes visual semantics, and 2) training TGDM that decodes the textual embeddings into pixels -- we point that existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one. To this end, we propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning. Specifically, given a few samples of customized concepts, we obtain non-confusing textual embeddings of a concept by fine-tuning CLIP via contrasting a concept and the over-segmented visual regions of other concepts. Experimental results demonstrate the effectiveness of CLIF in preventing the confusion of multi-customized concept generation.
Paper Structure (14 sections, 5 equations, 10 figures, 2 tables)

This paper contains 14 sections, 5 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: The black line and box denote the prevailing pipeline of customized generation methods. Our contribution is to contrast the textual embeddings of customized concepts in the Text Encoder stage, which is shown in the red line and dashed box.
  • Figure 2: Visualization of confusion in embedding space with "cat" as an anchor point, see Appendix for details. It shows an evident correlation between confusion and embedding distance.
  • Figure 3: Visualization of multi-concept customization for challenging cases. When the concepts to be customized belong to categories with high semantic similarity (all belonging to "humans" superclass), or when there is large regional overlap (e.g., the second and third rows) or combinations across styles (e.g., 2D combined with 3D characters in the fifth and sixth rows), the baseline methods suffer from, identity loss (red border), attribute leaking (blue border), or concept missing (green border), which are effectively circumvented by CLIF.
  • Figure 4: Pipeline of training data curation. We mix the customized concepts and common concepts at instance-level and segmentation-level, to help decouple multi-concept token embeddings which can eliminate the confusion issues.
  • Figure 5: Our two stage framework for multi-concept learning. We first fine-tune the text encoder to get contrastive concept embeddings, and then fine-tune the text-to-image decoder to synthesizing non-confusing images.
  • ...and 5 more figures