Table of Contents
Fetching ...

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal

TL;DR

The paper tackles multi-concept personalization in text-to-image diffusion by jointly learning concept tokens and latent masks. It introduces an EM-like optimization that alternates between refining tokens and latent masks derived from cross-attention maps, with DenseCRF used to sharpen masks and a masked diffusion objective to constrain learning to concept regions. The approach supports generating interactions among several user-defined concepts, demonstrated with quantitative metrics (e.g., mask IoU improvements, CLIP-based similarity, LPIPS diversity) and user studies, and shows resilience across cartoon and real-world styles. This framework reduces the need for explicit masks while enabling controlled, compositional image generation in complex scenes, signaling a scalable path for personalized, multi-concept diffusion generation in practical settings.

Abstract

Text-to-image (TTI) diffusion models have demonstrated impressive results in generating high-resolution images of complex and imaginative scenes. Recent approaches have further extended these methods with personalization techniques that allow them to integrate user-illustrated concepts (e.g., the user him/herself) using a few sample image illustrations. However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive. In this work, we propose a concept-driven TTI personalization framework that addresses these core challenges. We build on existing works that learn custom tokens for user-illustrated concepts, allowing those to interact with existing text tokens in the TTI model. However, importantly, to disentangle and better learn the concepts in question, we jointly learn (latent) segmentation masks that disentangle these concepts in user-provided image illustrations. We do so by introducing an Expectation Maximization (EM)-like optimization procedure where we alternate between learning the custom tokens and estimating (latent) masks encompassing corresponding concepts in user-supplied images. We obtain these masks based on cross-attention, from within the U-Net parameterized latent diffusion model and subsequent DenseCRF optimization. We illustrate that such joint alternating refinement leads to the learning of better tokens for concepts and, as a by-product, latent masks. We illustrate the benefits of the proposed approach qualitatively and quantitatively with several examples and use cases that can combine three or more entangled concepts.

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

TL;DR

The paper tackles multi-concept personalization in text-to-image diffusion by jointly learning concept tokens and latent masks. It introduces an EM-like optimization that alternates between refining tokens and latent masks derived from cross-attention maps, with DenseCRF used to sharpen masks and a masked diffusion objective to constrain learning to concept regions. The approach supports generating interactions among several user-defined concepts, demonstrated with quantitative metrics (e.g., mask IoU improvements, CLIP-based similarity, LPIPS diversity) and user studies, and shows resilience across cartoon and real-world styles. This framework reduces the need for explicit masks while enabling controlled, compositional image generation in complex scenes, signaling a scalable path for personalized, multi-concept diffusion generation in practical settings.

Abstract

Text-to-image (TTI) diffusion models have demonstrated impressive results in generating high-resolution images of complex and imaginative scenes. Recent approaches have further extended these methods with personalization techniques that allow them to integrate user-illustrated concepts (e.g., the user him/herself) using a few sample image illustrations. However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive. In this work, we propose a concept-driven TTI personalization framework that addresses these core challenges. We build on existing works that learn custom tokens for user-illustrated concepts, allowing those to interact with existing text tokens in the TTI model. However, importantly, to disentangle and better learn the concepts in question, we jointly learn (latent) segmentation masks that disentangle these concepts in user-provided image illustrations. We do so by introducing an Expectation Maximization (EM)-like optimization procedure where we alternate between learning the custom tokens and estimating (latent) masks encompassing corresponding concepts in user-supplied images. We obtain these masks based on cross-attention, from within the U-Net parameterized latent diffusion model and subsequent DenseCRF optimization. We illustrate that such joint alternating refinement leads to the learning of better tokens for concepts and, as a by-product, latent masks. We illustrate the benefits of the proposed approach qualitatively and quantitatively with several examples and use cases that can combine three or more entangled concepts.
Paper Structure (13 sections, 4 equations, 9 figures, 2 tables)

This paper contains 13 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Concept-driven image generation: Given images depicting multiple concepts (subjects and context/background), the top output is the illustration of the male ([v1]) being generated in the context/background of the female ([v3]) image by different methods. The bottom output illustrates the two concepts together in a single image. Dreambooth ruiz2023dreambooth (left) encodes [v1], [v2], [v3] from multiple input concept images. It fails to generate multi-concept interactions. Break-a-scene avrahami2023break (middle) disentangles [v1], [v2], [v3] from single image. This approach requires human-annotated masks. Our approach (right) disentangles [v1], [v2], [v3] from single image. The latent masks are obtained from EM-like optimization. Using these optimized masks, our method can automatically produce images with those concepts in new contexts either by themselves or, jointly, interacting with one another.
  • Figure 2: Overview of our EM-style Optimization Framework. Unique tokens are first assigned to the concepts. In this example, three tokens are assigned: [v1] to the male subject in the first image, [v2] to the female subject in the second image, and [v3] to the background in the second image. Where appropriate these tokens could be initialized with CLIP text embeddings semi-representative of the concept ( e.g., "person" for both [v1] and [v2]) or simply initialized with random vector embeddings. The latent masks are then initialized by averaging cross-attention maps (see left of Step 1) between the newly defined tokens and the corresponding images across randomly selected $50$ diffusion timesteps. The resulting attention maps are subsequently binarized and refined using DenseCRF, resulting in latent binary masks (see right of Step 1). In Step 2, the tokens are re-optimized for a newly sampled timestep with a combination of Mask Diffusion Loss (3) and Cross-Attention Loss (4) [see text for details]. The new tokens are then used to refine cross-attention and the mask. This alternating optimization continues for a fixed number of steps until masks and tokens converge.
  • Figure 3: Quantitative comparison - Mask IoU as a function of training steps. We compare the estimated (latent) mask and the manually annotated mask for the concepts illustrated in Figure \ref{['fig1:intro']}. Performance, as a function of mask optimization, for different tokens over training steps is illustrated: (a) Mask optimization for [v1]] (i.e. subject), (b) Mask optimization for [v2] (i.e. another subject), and (c) Mask optimization for [v3]. The yellow line indicates a baseline where the mask is obtained once and not updated jointly with the tokens (hence fixed value). A clear trend of improvement is observed for all masks. Curves are smoothed to highlight trends.
  • Figure 4: Qualitative results for concept-driven image generation. Here, we show a comparison between our results with mask optimization, baseline method without mask optimization, and ground truth approach.
  • Figure 5: Comparison of generated masks. Here we show user-defined masks, masks generated by the baseline method, and those produced by our approach. Note the significant improvement of the mask as a result of joint optimization (first row).
  • ...and 4 more figures