Table of Contents
Fetching ...

Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition

Chun-Hsiao Yeh, Ta-Ying Cheng, He-Yen Hsieh, Chuan-En Lin, Yi Ma, Andrew Markham, Niki Trigoni, H. T. Kung, Yubei Chen

TL;DR

This work addresses the challenge of multi-concept personalization in text-to-image diffusion by proposing Gen4Gen, a data-centric pipeline that composes multiple personalized concepts into realistic scenes, paired with the MyCanvas benchmark. It demonstrates that improving data quality and training-time prompting strategies yields substantial gains in multi-concept generation without changing model architectures, validated through a holistic CP-CLIP and TI-CLIP evaluation. The MyCanvas dataset and the two proposed metrics enable a more complete assessment of composition accuracy, concept fidelity, and generalization to new backgrounds, highlighting the importance of data curation for foundation models. The findings suggest a practical path forward for scalable, high-quality personalized generation and benchmark development in complex, multi-concept scenarios.

Abstract

Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models. First, current personalization techniques fail to reliably extend to multiple concepts -- we hypothesize this to be due to the mismatch between complex scenes and simple text descriptions in the pre-training dataset (e.g., LAION). Second, given an image containing multiple personalized concepts, there lacks a holistic metric that evaluates performance on not just the degree of resemblance of personalized concepts, but also whether all concepts are present in the image and whether the image accurately reflects the overall text description. To address these issues, we introduce Gen4Gen, a semi-automated dataset creation pipeline utilizing generative models to combine personalized concepts into complex compositions along with text-descriptions. Using this, we create a dataset called MyCanvas, that can be used to benchmark the task of multi-concept personalization. In addition, we design a comprehensive metric comprising two scores (CP-CLIP and TI-CLIP) for better quantifying the performance of multi-concept, personalized text-to-image diffusion methods. We provide a simple baseline built on top of Custom Diffusion with empirical prompting strategies for future researchers to evaluate on MyCanvas. We show that by improving data quality and prompting strategies, we can significantly increase multi-concept personalized image generation quality, without requiring any modifications to model architecture or training algorithms.

Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition

TL;DR

This work addresses the challenge of multi-concept personalization in text-to-image diffusion by proposing Gen4Gen, a data-centric pipeline that composes multiple personalized concepts into realistic scenes, paired with the MyCanvas benchmark. It demonstrates that improving data quality and training-time prompting strategies yields substantial gains in multi-concept generation without changing model architectures, validated through a holistic CP-CLIP and TI-CLIP evaluation. The MyCanvas dataset and the two proposed metrics enable a more complete assessment of composition accuracy, concept fidelity, and generalization to new backgrounds, highlighting the importance of data curation for foundation models. The findings suggest a practical path forward for scalable, high-quality personalized generation and benchmark development in complex, multi-concept scenarios.

Abstract

Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models. First, current personalization techniques fail to reliably extend to multiple concepts -- we hypothesize this to be due to the mismatch between complex scenes and simple text descriptions in the pre-training dataset (e.g., LAION). Second, given an image containing multiple personalized concepts, there lacks a holistic metric that evaluates performance on not just the degree of resemblance of personalized concepts, but also whether all concepts are present in the image and whether the image accurately reflects the overall text description. To address these issues, we introduce Gen4Gen, a semi-automated dataset creation pipeline utilizing generative models to combine personalized concepts into complex compositions along with text-descriptions. Using this, we create a dataset called MyCanvas, that can be used to benchmark the task of multi-concept personalization. In addition, we design a comprehensive metric comprising two scores (CP-CLIP and TI-CLIP) for better quantifying the performance of multi-concept, personalized text-to-image diffusion methods. We provide a simple baseline built on top of Custom Diffusion with empirical prompting strategies for future researchers to evaluate on MyCanvas. We show that by improving data quality and prompting strategies, we can significantly increase multi-concept personalized image generation quality, without requiring any modifications to model architecture or training algorithms.
Paper Structure (19 sections, 4 equations, 12 figures, 2 tables)

This paper contains 19 sections, 4 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Given very few source images representing several concepts (each denoted by a concept*), we introduce a semi-automated dataset creation pipeline, Gen4Gen, to compose these concept* into realistic scenes with complex compositions, accompanied by detailed text descriptions, namely, MyCanvas. Using this composed MyCanvas dataset boosts the performance of previous methods in multi-concept personalization without amending the architecture or training algorithms. Our MyCanvas dataset addresses issues in prior works that fail to extend to multiple concept generation (beyond three concepts) or challenging cases (e.g., dog and cat, teddybear and plushie).
  • Figure 2: Overview of our Gen4Gen Pipeline for Creating MyCanvas Dataset. Given source images representing multiple concepts, we leverage recent advancements in image foreground extraction, LLMs, MLLMs, and inpainting to compose realistic, personalized images and paired text descriptions. Our Gen4Gen pipeline has three stages. First (1), we apply a category-agnostic saliency object detector to segment the foreground given objects within composition $O'$. Second (2), we inquire the LLM to provide probable bounding box compositions. This forms the composite foreground image $I_{fg}$ and its corresponding mask $\mathcal{M}(I_{fg})$. In addition, we ask the LLM to provide a set of background prompts describing potential scenes for $O'$. Third (3), we use a diffusion inpainting model to repaint $I_{fg}$ by embedding it within a background image $I_{bg}$ sourced from the internet to produce the final image $I_{O'}$. To increase the variety of text descriptions while maintaining the alignment, we then inquire a MLLM (LLaVA) to provide a detailed caption describing $I_{O'}$ to a subset of all combinations.
  • Figure 3: Examples of our MyCanvas Dataset. Our semi-automated generated dataset contains multiple personalized objects in complex compositions with high resolution, realistic images along with accurate text descriptions (short and detailed).
  • Figure 4: MyCanvas Dataset Statistics. a) A pie chart depicts that roughly 30% of the images in MyCanvas are paired with text descriptions over a length of 20 words. b) Word cloud of the categories used within the images to show the variety of objects used. c) and d) Word cloud of the frequent descriptions used during training and inference, which are very different to ensure fairness in comparison.
  • Figure 5: Qualitative Results for Multi-Concept Composition. We present four sets of results in ascending order of composition difficulty (more personalized concepts). Given training methods like Custom Diffusion, our MyCanvas brings drastic improvements in disentangling object identities similar in the latent space (e.g., cat and lion, tractor1 and tractor2), preserving the distinctiveness of each object. Adding our prompting strategy gains even more improvements in aligning the caption during image generation (i.e., all the objects are properly reflected). More results are presented in the Appendix.
  • ...and 7 more figures