Table of Contents
Fetching ...

Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models

Sangwon Jang, Jaehyeong Jo, Kimin Lee, Sung Ju Hwang

TL;DR

In human evaluation, MuDI obtains twice the success rate for personalizing multiple subjects without identity mixing over existing baselines and is preferred over 70% against the strongest baseline and a new metric is introduced to better evaluate the performance of the method on multi-subject personalization.

Abstract

Text-to-image diffusion models have shown remarkable success in generating personalized subjects based on a few reference images. However, current methods often fail when generating multiple subjects simultaneously, resulting in mixed identities with combined attributes from different subjects. In this work, we present MuDI, a novel framework that enables multi-subject personalization by effectively decoupling identities from multiple subjects. Our main idea is to utilize segmented subjects generated by a foundation model for segmentation (Segment Anything) for both training and inference, as a form of data augmentation for training and initialization for the generation process. Moreover, we further introduce a new metric to better evaluate the performance of our method on multi-subject personalization. Experimental results show that our MuDI can produce high-quality personalized images without identity mixing, even for highly similar subjects as shown in Figure 1. Specifically, in human evaluation, MuDI obtains twice the success rate for personalizing multiple subjects without identity mixing over existing baselines and is preferred over 70% against the strongest baseline.

Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models

TL;DR

In human evaluation, MuDI obtains twice the success rate for personalizing multiple subjects without identity mixing over existing baselines and is preferred over 70% against the strongest baseline and a new metric is introduced to better evaluate the performance of the method on multi-subject personalization.

Abstract

Text-to-image diffusion models have shown remarkable success in generating personalized subjects based on a few reference images. However, current methods often fail when generating multiple subjects simultaneously, resulting in mixed identities with combined attributes from different subjects. In this work, we present MuDI, a novel framework that enables multi-subject personalization by effectively decoupling identities from multiple subjects. Our main idea is to utilize segmented subjects generated by a foundation model for segmentation (Segment Anything) for both training and inference, as a form of data augmentation for training and initialization for the generation process. Moreover, we further introduce a new metric to better evaluate the performance of our method on multi-subject personalization. Experimental results show that our MuDI can produce high-quality personalized images without identity mixing, even for highly similar subjects as shown in Figure 1. Specifically, in human evaluation, MuDI obtains twice the success rate for personalizing multiple subjects without identity mixing over existing baselines and is preferred over 70% against the strongest baseline.
Paper Structure (76 sections, 2 equations, 40 figures, 3 tables, 3 algorithms)

This paper contains 76 sections, 2 equations, 40 figures, 3 tables, 3 algorithms.

Figures (40)

  • Figure 1: Given a few images of multiple subjects (red boxes), MuDI can personalize a text-to-image model (e.g., SDXL podell2023sdxl) to generate multi-subject images without identity mixing. Some reference images (e.g., Cloud Man and Blue Alien) are created by Sora Sora, introducing novel concepts not previously encountered by SDXL.
  • Figure 2: Comparison of multi-subject personalization methods using Corgi and Chow Chow images (red boxes) using SDXL podell2023sdxl. DreamBooth ruiz2023dreambooth produces mixed identity dogs, such as a Corgi with Chow Chow ears. Cut-Mix han2023svdiff often generates artifacts like unnatural vertical lines. Additionally, using layout conditioning like region control gu2024mix proves ineffective in preventing identity blending in recent advanced diffusion models such as SDXL. In contrast, ours successfully personalizes each dog, avoiding identity mixing and artifacts observed in prior methods.
  • Figure 3: Overview of MuDI. (a) We automatically obtain segmented subjects using SAM kirillov2023sam and OWLv2 minderer2024owlv2 in the preprocessing stage. (b) We augment the training data by randomly positioning segmented subjects with controllable scales to train the diffusion model $\epsilon_{\theta}$. We refer to this data augmentation method as Seg-Mix. (c) We initialize the generation process with mean-shifted noise created from segmented subjects, which provides a signal for separating identities without missing.
  • Figure 4: (Left) Overview of Detect-and-Compare. We calculate the mean similarities between detected subjects and reference images to evaluate multi-subject fidelity. Specifically, we compare $\bm{S}^{GT}$ and $\bm{S}^{DC}$. We provide pseudo-code in \ref{['alg:dnc']}. (Right) Correlation between metrics and human evaluation. We report the Spearman's rank correlation coefficient and AUROC.
  • Figure 5: Qualitative comparison of Textual Inversion (TI) gal2022image, DreamBooth (DB) ruiz2023dreambooth, DB with region control gu2024mix, Cut-Mix han2023svdiff, and MuDI. Images in the same column are generated with the same random seed. We provide more examples in \ref{['fig:appendix_qual1']}.
  • ...and 35 more figures