Visual Concept-driven Image Generation with Text-to-Image Diffusion Model
Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal
TL;DR
The paper tackles multi-concept personalization in text-to-image diffusion by jointly learning concept tokens and latent masks. It introduces an EM-like optimization that alternates between refining tokens and latent masks derived from cross-attention maps, with DenseCRF used to sharpen masks and a masked diffusion objective to constrain learning to concept regions. The approach supports generating interactions among several user-defined concepts, demonstrated with quantitative metrics (e.g., mask IoU improvements, CLIP-based similarity, LPIPS diversity) and user studies, and shows resilience across cartoon and real-world styles. This framework reduces the need for explicit masks while enabling controlled, compositional image generation in complex scenes, signaling a scalable path for personalized, multi-concept diffusion generation in practical settings.
Abstract
Text-to-image (TTI) diffusion models have demonstrated impressive results in generating high-resolution images of complex and imaginative scenes. Recent approaches have further extended these methods with personalization techniques that allow them to integrate user-illustrated concepts (e.g., the user him/herself) using a few sample image illustrations. However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive. In this work, we propose a concept-driven TTI personalization framework that addresses these core challenges. We build on existing works that learn custom tokens for user-illustrated concepts, allowing those to interact with existing text tokens in the TTI model. However, importantly, to disentangle and better learn the concepts in question, we jointly learn (latent) segmentation masks that disentangle these concepts in user-provided image illustrations. We do so by introducing an Expectation Maximization (EM)-like optimization procedure where we alternate between learning the custom tokens and estimating (latent) masks encompassing corresponding concepts in user-supplied images. We obtain these masks based on cross-attention, from within the U-Net parameterized latent diffusion model and subsequent DenseCRF optimization. We illustrate that such joint alternating refinement leads to the learning of better tokens for concepts and, as a by-product, latent masks. We illustrate the benefits of the proposed approach qualitatively and quantitatively with several examples and use cases that can combine three or more entangled concepts.
