Table of Contents
Fetching ...

MagiCapture: High-Resolution Multi-Concept Portrait Customization

Junha Hyung, Jaeyo Shin, Jaegul Choo

TL;DR

MagiCapture tackles the challenge of high-fidelity, multi-concept portrait customization from few references by introducing a two-phase optimization framework, Masked Reconstruction, Composed Prompt Learning, and a novel Attention Refocusing loss. The method trains with pseudo-labels on a composed prompt to jointly integrate subject identity and reference style while preventing information leakage, aided by postprocessing for fidelity. Empirical results show superior identity, style, and aesthetic metrics compared to DreamBooth, Textual Inversion, and Custom Diffusion, with ablations validating the contributions. The work also demonstrates generalization to non-human objects and discusses limitations such as artifacts and potential biases, outlining future work for bias mitigation and ethical considerations.

Abstract

Large-scale text-to-image models including Stable Diffusion are capable of generating high-fidelity photorealistic portrait images. There is an active research area dedicated to personalizing these models, aiming to synthesize specific subjects or styles using provided sets of reference images. However, despite the plausible results from these personalization methods, they tend to produce images that often fall short of realism and are not yet on a commercially viable level. This is particularly noticeable in portrait image generation, where any unnatural artifact in human faces is easily discernible due to our inherent human bias. To address this, we introduce MagiCapture, a personalization method for integrating subject and style concepts to generate high-resolution portrait images using just a few subject and style references. For instance, given a handful of random selfies, our fine-tuned model can generate high-quality portrait images in specific styles, such as passport or profile photos. The main challenge with this task is the absence of ground truth for the composed concepts, leading to a reduction in the quality of the final output and an identity shift of the source subject. To address these issues, we present a novel Attention Refocusing loss coupled with auxiliary priors, both of which facilitate robust learning within this weakly supervised learning setting. Our pipeline also includes additional post-processing steps to ensure the creation of highly realistic outputs. MagiCapture outperforms other baselines in both quantitative and qualitative evaluations and can also be generalized to other non-human objects.

MagiCapture: High-Resolution Multi-Concept Portrait Customization

TL;DR

MagiCapture tackles the challenge of high-fidelity, multi-concept portrait customization from few references by introducing a two-phase optimization framework, Masked Reconstruction, Composed Prompt Learning, and a novel Attention Refocusing loss. The method trains with pseudo-labels on a composed prompt to jointly integrate subject identity and reference style while preventing information leakage, aided by postprocessing for fidelity. Empirical results show superior identity, style, and aesthetic metrics compared to DreamBooth, Textual Inversion, and Custom Diffusion, with ablations validating the contributions. The work also demonstrates generalization to non-human objects and discusses limitations such as artifacts and potential biases, outlining future work for bias mitigation and ethical considerations.

Abstract

Large-scale text-to-image models including Stable Diffusion are capable of generating high-fidelity photorealistic portrait images. There is an active research area dedicated to personalizing these models, aiming to synthesize specific subjects or styles using provided sets of reference images. However, despite the plausible results from these personalization methods, they tend to produce images that often fall short of realism and are not yet on a commercially viable level. This is particularly noticeable in portrait image generation, where any unnatural artifact in human faces is easily discernible due to our inherent human bias. To address this, we introduce MagiCapture, a personalization method for integrating subject and style concepts to generate high-resolution portrait images using just a few subject and style references. For instance, given a handful of random selfies, our fine-tuned model can generate high-quality portrait images in specific styles, such as passport or profile photos. The main challenge with this task is the absence of ground truth for the composed concepts, leading to a reduction in the quality of the final output and an identity shift of the source subject. To address these issues, we present a novel Attention Refocusing loss coupled with auxiliary priors, both of which facilitate robust learning within this weakly supervised learning setting. Our pipeline also includes additional post-processing steps to ensure the creation of highly realistic outputs. MagiCapture outperforms other baselines in both quantitative and qualitative evaluations and can also be generalized to other non-human objects.
Paper Structure (29 sections, 11 equations, 17 figures, 3 tables)

This paper contains 29 sections, 11 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Generated results of the proposed MagiCapture, a multi-concept personalization method for integrating subject and style concepts to generate high-resolution portrait images using just a few subject and style references.
  • Figure 2: The overall pipeline of MagiCapture, where the training process is formulated as multi-task learning of three different tasks: source, reference, and composed prompt learning. In the composed prompt learning, reference style images serve as pseudo-labels, along with auxiliary identity loss between the source and predicted images. Attention Refocusing loss is applied to all three tasks. After training, users can generate high-fidelity images with integrated concepts and can further manipulate them using varying text conditions.
  • Figure 3: Visualization of aggregated attention maps from UNet layers before and after the application of Attention Refocusing (AR) loss illustrates its importance in achieving information disentanglement and preventing information spill.
  • Figure 4: Curated results of MagiCapture.
  • Figure 5: Qualitative comparisons of MagiCapture with other baseline methods.
  • ...and 12 more figures