Table of Contents
Fetching ...

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

Ganggui Ding, Canyu Zhao, Wen Wang, Zhen Yang, Zide Liu, Hao Chen, Chunhua Shen

TL;DR

FreeCustom addresses the challenge of rapid, training-free multi-concept image generation by introducing a dual-path denoising framework and a Multi-Reference Self-Attention (MRSA) mechanism that queries reference concepts during generation. A weighted mask strategy and selective MRSA replacement in deeper U-Net blocks enable accurate preservation of each concept's identity while aligning with the target text, all without fine-tuning. Experiments show competitive performance for single-concept customization and clear advantages for multi-concept composition, with superior time efficiency and strong user-study results. The approach further supports context-aware reference interactions and can augment other diffusion-based methods, offering practical benefits for diverse applications and model compatibility.

Abstract

Benefiting from large-scale pre-trained text-to-image (T2I) generative models, impressive progress has been achieved in customized image generation, which aims to generate user-specified concepts. Existing approaches have extensively focused on single-concept customization and still encounter challenges when it comes to complex scenarios that involve combining multiple concepts. These approaches often require retraining/fine-tuning using a few images, leading to time-consuming training processes and impeding their swift implementation. Furthermore, the reliance on multiple images to represent a singular concept increases the difficulty of customization. To this end, we propose FreeCustom, a novel tuning-free method to generate customized images of multi-concept composition based on reference concepts, using only one image per concept as input. Specifically, we introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy that enables the generated image to access and focus more on the reference concepts. In addition, MRSA leverages our key finding that input concepts are better preserved when providing images with context interactions. Experiments show that our method's produced images are consistent with the given concepts and better aligned with the input text. Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization, but is simpler. Codes can be found at https://github.com/aim-uofa/FreeCustom.

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

TL;DR

FreeCustom addresses the challenge of rapid, training-free multi-concept image generation by introducing a dual-path denoising framework and a Multi-Reference Self-Attention (MRSA) mechanism that queries reference concepts during generation. A weighted mask strategy and selective MRSA replacement in deeper U-Net blocks enable accurate preservation of each concept's identity while aligning with the target text, all without fine-tuning. Experiments show competitive performance for single-concept customization and clear advantages for multi-concept composition, with superior time efficiency and strong user-study results. The approach further supports context-aware reference interactions and can augment other diffusion-based methods, offering practical benefits for diverse applications and model compatibility.

Abstract

Benefiting from large-scale pre-trained text-to-image (T2I) generative models, impressive progress has been achieved in customized image generation, which aims to generate user-specified concepts. Existing approaches have extensively focused on single-concept customization and still encounter challenges when it comes to complex scenarios that involve combining multiple concepts. These approaches often require retraining/fine-tuning using a few images, leading to time-consuming training processes and impeding their swift implementation. Furthermore, the reliance on multiple images to represent a singular concept increases the difficulty of customization. To this end, we propose FreeCustom, a novel tuning-free method to generate customized images of multi-concept composition based on reference concepts, using only one image per concept as input. Specifically, we introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy that enables the generated image to access and focus more on the reference concepts. In addition, MRSA leverages our key finding that input concepts are better preserved when providing images with context interactions. Experiments show that our method's produced images are consistent with the given concepts and better aligned with the input text. Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization, but is simpler. Codes can be found at https://github.com/aim-uofa/FreeCustom.
Paper Structure (32 sections, 4 equations, 21 figures, 3 tables)

This paper contains 32 sections, 4 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Results of customized multi-concept composition. Our method excels at rapidly generating high-quality images with multiple concept combinations, without any model parameter tuning. The identity of each concept is remarkably preserved. Furthermore, our method exhibits great versatility and robustness when dealing with different categories of concepts. This versatility allows users to generate customized images that involve diverse combinations of concepts, catering to their specific needs and preferences. Best viewed on screen.
  • Figure 2: Paradigm comparison. Previous methods for customization can be categorized into two main categories: (a) training-based methods and (b) tailored models for generalizable customization. Training-based methods often involve fine-tuning an entire model (Type I) or learning a text embedding to represent a specific subject (Type II). Tailored models typically require re-training on large-scale image datasets to establish a versatile foundation. Unlike these two types of methods, our approach can directly generate customized images of multi-concept combinations without any additional training.
  • Figure 3: Results of single-concept customization.
  • Figure 4: Overview of the pipeline. Given a set of reference images $\mathcal{I} = \{I_1, I_2, I_3\}$ and their corresponding prompts $\mathcal{P} = \{P_1, P_2, P_3\}$, we generate a multi-concept customized composition image $I$ aligned to the target prompt $P$. (a) We use a VAE encoder to convert reference images into the latent representation $\mathbf z_0'$ and a segmentation network to extract masks of the concepts. (b) The denoising process involves two paths: 1) the concepts reference path and 2) the concepts composition path. In 1), we employ a diffusion forward process to transform $\mathbf z_0'$ into $\mathbf z_t'$, subsequently passing $\mathbf z_t'$ to the U-Net $\epsilon_\theta$. Notably, the output of $\epsilon_\theta$ isn't used. In 2), we initially sample $\mathbf z_T \sim \mathcal{N} (0,\textbf{I})$ and iteratively denoise the latent until we obtain $\mathbf z_0$. At each time step t, we directly transmit the current latent $\mathbf z_t$ to the modified U-Net $\epsilon_\theta^*$ and employ the MRSA to integrate the features from the last two blocks of both the U-Net $\epsilon_\theta$ and the U-Net $\epsilon_\theta^*$. Finally, we utilize a VAE decoder to convert $\mathbf{z_0}$ into the final image $I$. (c) The MRSA mechanism. i) Feature injection happens in the self-attention module between U-Net layers, ii) we apply MRSA using Eq. \ref{['eq: MRSA with weight mask']}.
  • Figure 5: Weighted mask strategy. $\mathbf w$ is the weight of the mask, where the first weight corresponds to the main edited subject, and the following three weights are for the input concepts.
  • ...and 16 more figures