Learning to Customize Text-to-Image Diffusion In Diverse Context
Taewook Kim, Wei Chen, Qiang Qiu
TL;DR
This work tackles concept overfitting in personalizing text-to-image diffusion models by diversifying context purely in the textual space. It introduces a Masked Language Modeling objective, used with a context-rich prompt set and a pretrained contextualizer, to regularize concept embeddings and preserve contextual semantics during customization. The method is architecture-agnostic and yields consistent improvements in text-prompt fidelity and, correspondingly, image prompt fidelity across four baseline T2I customization methods. Theoretical and empirical analyses demonstrate that semantic enhancement in the textual space translates to improved generation in the image space, making the approach broadly applicable and cost-effective. Overall, the approach offers a practical pathway to strengthen semantic alignment without altering model architectures or requiring large additional data.
Abstract
Most text-to-image customization techniques fine-tune models on a small set of \emph{personal concept} images captured in minimal contexts. This often results in the model becoming overfitted to these training images and unable to generalize to new contexts in future text prompts. Existing customization methods are built on the success of effectively representing personal concepts as textual embeddings. Thus, in this work, we resort to diversifying the context of these personal concepts \emph{solely} within the textual space by simply creating a contextually rich set of text prompts, together with a widely used self-supervised learning objective. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space, and this effect further extends to the image space, resulting in higher prompt fidelity for generated images. Additionally, our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods. We demonstrate the broad applicability of our approach by combining it with four different baseline methods, achieving notable CLIP score improvements.
