MultiBooth: Towards Generating All Your Concepts in an Image from Text

Chenyang Zhu; Kai Li; Yue Ma; Chunming He; Xiu Li

MultiBooth: Towards Generating All Your Concepts in an Image from Text

Chenyang Zhu, Kai Li, Yue Ma, Chunming He, Xiu Li

TL;DR

MultiBooth addresses multi-concept customization in diffusion-based text-to-image generation by separating learning into single-concept and multi-concept phases. It introduces Adaptive Concept Normalization (ACN) and efficient LoRA-based concept encoding to produce compact, faithful concept embeddings, and a Regional Customization Module (RCM) that uses bounding boxes to guide region-specific cross-attention for assembling multiple concepts. The method achieves superior fidelity and prompt alignment with lower inference cost compared to state-of-the-art MCC approaches, demonstrated on diverse subject categories. This plug-and-play framework enables scalable generation of images containing arbitrary combinations of user-defined concepts with efficient inference.

Abstract

This paper introduces MultiBooth, a novel and efficient technique for multi-concept customization in image generation from text. Despite the significant advancements in customized generation methods, particularly with the success of diffusion models, existing methods often struggle with multi-concept scenarios due to low concept fidelity and high inference cost. MultiBooth addresses these issues by dividing the multi-concept generation process into two phases: a single-concept learning phase and a multi-concept integration phase. During the single-concept learning phase, we employ a multi-modal image encoder and an efficient concept encoding technique to learn a concise and discriminative representation for each concept. In the multi-concept integration phase, we use bounding boxes to define the generation area for each concept within the cross-attention map. This method enables the creation of individual concepts within their specified regions, thereby facilitating the formation of multi-concept images. This strategy not only improves concept fidelity but also reduces additional inference cost. MultiBooth surpasses various baselines in both qualitative and quantitative evaluations, showcasing its superior performance and computational efficiency. Project Page: https://multibooth.github.io/

MultiBooth: Towards Generating All Your Concepts in an Image from Text

TL;DR

Abstract

Paper Structure (15 sections, 8 equations, 5 figures, 4 tables)

This paper contains 15 sections, 8 equations, 5 figures, 4 tables.

Introduction
Related Work
Method
Preliminaries
Single-Concept Learning
Multi-modal Concept Extraction.
Adaptive Concept Normalization.
Efficient Concept Encoding.
Multi-Concept Integration
Regional Customization Module.
Experiment
Comparative Study
Ablation Study
Discussions
Conclusion

Figures (5)

Figure 1: MultiBooth can learn individual customization concepts through a few examples and then combine these learned concepts to create multi-concept images based on text prompts. The results indicate that our MultiBooth can effectively preserve high image fidelity and text alignment when encountering complex multi-concept generation demands, including (a) stylization, (b) different spatial relationships, and (c) contextualization.
Figure 2: Overall Pipeline of MultiBooth. (a) During the single-concept learning phase, a multi-modal encoder and LoRA parameters are trained to encode every single concept. (b) During the multi-concept integration phase, we first convert $S^*$ and $V^*$ into text embeddings, which are then combined with the corresponding LoRA to form single-concept modules. These single-concept modules, along with the bounding boxes, are intended to serve as input for the regional customization module.
Figure 3: Regional Customization Module. We initially divide the image feature into several regions via bounding boxes to acquire the query $Q$ for each concept. Subsequently, we combine the single-concept module with $W_k$ and $W_v$ to derive the corresponding key $K$ and value $V$. After that, we perform the attention operation on the obtained $Q$, $K$, and $V$ to get a partial attention output. The above procedure is applied to each concept simultaneously, forming the final attention output.
Figure 4: Qualitative comparisons. Our method outperforms all the compared methods in image fidelity and prompt alignment.
Figure 5: Qualitative ablation results.

MultiBooth: Towards Generating All Your Concepts in an Image from Text

TL;DR

Abstract

MultiBooth: Towards Generating All Your Concepts in an Image from Text

Authors

TL;DR

Abstract

Table of Contents

Figures (5)