Table of Contents
Fetching ...

Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models

Gihyun Kwon, Simon Jenni, Dingzeyu Li, Joon-Young Lee, Jong Chul Ye, Fabian Caba Heilbron

TL;DR

This work tackles the challenge of generating images that fuse multiple personalized concepts in text-to-image diffusion models. It introduces Concept Weaver, a tuning-free framework that first creates a semantics-aligned template image and then performs region-aware concept fusion using individualized concept models. The approach combines concept bank training, inversion-based guidance, region masks, and a novel multi-concept sampling strategy with feature injection and concept-aware conditioning to preserve structure while aligning appearances with multiple concepts. Experimental results show superior concept fidelity, scalability to more than two concepts, and applicability to real-image editing, with efficient potential extensions via LoRA fine-tuning.

Abstract

While there has been significant progress in customizing text-to-image generation models, generating images that combine multiple personalized concepts remains challenging. In this work, we introduce Concept Weaver, a method for composing customized text-to-image diffusion models at inference time. Specifically, the method breaks the process into two steps: creating a template image aligned with the semantics of input prompts, and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that our method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore, the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.

Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models

TL;DR

This work tackles the challenge of generating images that fuse multiple personalized concepts in text-to-image diffusion models. It introduces Concept Weaver, a tuning-free framework that first creates a semantics-aligned template image and then performs region-aware concept fusion using individualized concept models. The approach combines concept bank training, inversion-based guidance, region masks, and a novel multi-concept sampling strategy with feature injection and concept-aware conditioning to preserve structure while aligning appearances with multiple concepts. Experimental results show superior concept fidelity, scalability to more than two concepts, and applicability to real-image editing, with efficient potential extensions via LoRA fine-tuning.

Abstract

While there has been significant progress in customizing text-to-image generation models, generating images that combine multiple personalized concepts remains challenging. In this work, we introduce Concept Weaver, a method for composing customized text-to-image diffusion models at inference time. Specifically, the method breaks the process into two steps: creating a template image aligned with the semantics of input prompts, and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that our method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore, the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.
Paper Structure (18 sections, 11 equations, 16 figures, 4 tables)

This paper contains 18 sections, 11 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Concept Weaver's Generation Results. Our method, Concept Weaver, can inject the appearance of arbitrary off-the-shelf concepts (from a Bank of Concepts) to generate realistic images.
  • Figure 2: Concept Weaver's Method. First, we fine-tune a text-to-timage model for each target concept in the bank (Step 1). Then we source a template image (Step 2). Given the template image, we apply the inversion process with simultaneous feature extraction to save its structural information (Step 3). In Step 4, we extract region masks from the template image with off-the-shelf models sam. With extracted features and masks, we generate the multi-concept image in Step 5.
  • Figure 3: Image Inversion and Multi-Concept Fusion. (a) To extract and save the structural information of template images, we save the intermediate latent of images during the DDIM forward process. With the fully inverted noise, we extract the feature outputs from denoising U-Net during the DDIM reverse process. (b) From the noisy inverted latent, we start the multi-concept fusion generation. We denoise the noisy image with fine-tuned personalized models. After obtaining multiple cross-attention layer features, we fuse the different features from each masked region. In this step, we inject the pre-calculated self-attention and resnet features into the networks.
  • Figure 4: Qualitative Evaluation of Multi-Concept Generation. We assess the quality of image generation by our method compared to baseline approaches, using prompts that incorporate every concept from a predefined concept bank (shown on the left). First row: our method successfully preserves the appearance of the target concepts while all baselines fail. Second row: here Mix-of-show is able to preserve the identity but struggles when the prompt includes a close interaction. Third row: all baseline approaches fail to generate the prompted action or to preserve the concept's attributes; our model instead generates an image that follows the prompt while preserving the appearance of the concepts. Overall, our model generates concept-aware outputs without any concept mixing problems.
  • Figure 5: Towards More Complex Multi-Concept Generation. We compare our method against Mix-of-show at generating images with prompts involving four challenging concepts. Mix-of-show exhibits severe problems of concept missing. Our method, instead, can successfully generate realistic concept-aware images when using a larger number of concepts.
  • ...and 11 more figures