Table of Contents
Fetching ...

MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation

Jiaxiu Jiang, Yabo Zhang, Kailai Feng, Xiaohe Wu, Wenbo Li, Renjing Pei, Fan Li, Wangmeng Zuo

TL;DR

MC^2 addresses the challenge of customizing multiple concepts by enabling inference-time integration of independently trained single-concept models through Multi-concept Guidance (MCG). It adaptively refines cross-attention to spatially disentangle concepts and merges latent representations without joint training, supporting heterogeneous architectures like Textual Inversion, LoRA, and DreamBooth. The authors introduce the MC++ benchmark to evaluate two-to-four concept compositions and demonstrate superior prompt-reference alignment and compositional generation over training-based baselines, with a public implementation. The approach also extends to compositional generation and includes attention-grounding training to improve trigger-token alignment, marking a practical advance for flexible, high-fidelity multi-concept diffusion synthesis.

Abstract

Customized text-to-image generation, which synthesizes images based on user-specified concepts, has made significant progress in handling individual concepts. However, when extended to multiple concepts, existing methods often struggle with properly integrating different models and avoiding the unintended blending of characteristics from distinct concepts. In this paper, we propose MC$^2$, a novel approach for multi-concept customization that enhances flexibility and fidelity through inference-time optimization. MC$^2$ enables the integration of multiple single-concept models with heterogeneous architectures. By adaptively refining attention weights between visual and textual tokens, our method ensures that image regions accurately correspond to their associated concepts while minimizing interference between concepts. Extensive experiments demonstrate that MC$^2$ outperforms training-based methods in terms of prompt-reference alignment. Furthermore, MC$^2$ can be seamlessly applied to text-to-image generation, providing robust compositional capabilities. To facilitate the evaluation of multi-concept customization, we also introduce a new benchmark, MC++. The code will be publicly available at https://github.com/JIANGJiaXiu/MC-2.

MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation

TL;DR

MC^2 addresses the challenge of customizing multiple concepts by enabling inference-time integration of independently trained single-concept models through Multi-concept Guidance (MCG). It adaptively refines cross-attention to spatially disentangle concepts and merges latent representations without joint training, supporting heterogeneous architectures like Textual Inversion, LoRA, and DreamBooth. The authors introduce the MC++ benchmark to evaluate two-to-four concept compositions and demonstrate superior prompt-reference alignment and compositional generation over training-based baselines, with a public implementation. The approach also extends to compositional generation and includes attention-grounding training to improve trigger-token alignment, marking a practical advance for flexible, high-fidelity multi-concept diffusion synthesis.

Abstract

Customized text-to-image generation, which synthesizes images based on user-specified concepts, has made significant progress in handling individual concepts. However, when extended to multiple concepts, existing methods often struggle with properly integrating different models and avoiding the unintended blending of characteristics from distinct concepts. In this paper, we propose MC, a novel approach for multi-concept customization that enhances flexibility and fidelity through inference-time optimization. MC enables the integration of multiple single-concept models with heterogeneous architectures. By adaptively refining attention weights between visual and textual tokens, our method ensures that image regions accurately correspond to their associated concepts while minimizing interference between concepts. Extensive experiments demonstrate that MC outperforms training-based methods in terms of prompt-reference alignment. Furthermore, MC can be seamlessly applied to text-to-image generation, providing robust compositional capabilities. To facilitate the evaluation of multi-concept customization, we also introduce a new benchmark, MC++. The code will be publicly available at https://github.com/JIANGJiaXiu/MC-2.
Paper Structure (21 sections, 11 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 11 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of our method. MC$^2$ composes separately trained customized models to generate compositions of multiple customized concepts. Here, we train a Textual Inversion gal2022image model for <lighthouse>, a LoRA hu2022lora for <person>, and a DreamBooth ruiz2023dreambooth model for <cat> with the reference images of each concept. The reference images are from the CustomConcept101 datasetkumari2023multi.
  • Figure 2: Illustration of our proposed MC$\mathbf{^2}$. Multi-concept Guidance (MCG) is performed at each step of the diffusion process. In the first stage, several parallel diffusion models with different customized modules take the same noise map $z_t$ as input. $p_0$, $p_1$ and $p_n$ denote text prompts encoded by the CLIP text encoder. Then the cross-attention maps of certain tokens are extracted to compute the $\mathcal{L}_{MCG}$ to update $z_t$. In the second stage, the diffusion models take $z_t'$ as input and $z_{t-1}$ is calculated via semantic merging. When omitting the customized modules and substituting $\mathcal{L}_{CompGen}$ for $\mathcal{L}_{MCG}$, the framework applies to plain compositional generation.
  • Figure 3: Visualization of MCG. MCG adaptively refines the attention weights between visual and textual tokens, directing image regions to focus on their associated words while diminishing the impact of irrelevant ones.
  • Figure 4: Effect of attention grounding. With attention grounding training, the concepts have more clear attention maps.
  • Figure 5: Qualitative comparisons of customized multi-concept generation methods. In the leftmost column are concept reference images. Only one image is shown here for each concept. The models are trained with more images. Our method demonstrates more consistency with the reference images compared to the competing methods. The competing methods sometimes omit the specified concepts.
  • ...and 9 more figures