Table of Contents
Fetching ...

Layout-and-Retouch: A Dual-stage Framework for Improving Diversity in Personalized Image Generation

Kangyeol Kim, Wooseok Seo, Sehyun Nam, Bodam Kim, Suhyeon Jeong, Wonwoo Cho, Jaegul Choo, Youngjae Yu

TL;DR

This work tackles the challenge of balancing prompt fidelity and identity preservation in personalized text-to-image generation. It introduces Layout-and-Retouch, a two-stage framework that first uses step-blended denoising with vanilla T2I models to generate diverse layouts, then retouches the subject by performing multi-source attention swap with a reference image to preserve identity while adhering to the prompt. Empirical results on ViCo and DreamMatcher benchmarks show improved layout diversity, stronger prompt fidelity, and robust identity preservation compared to plug-in baselines, including under challenging prompts. The method offers a practical, plug-in friendly strategy for diverse, personalized image synthesis and points to future improvements by adopting stronger backbone models to further enhance layout understanding and consistency.

Abstract

Personalized text-to-image (P-T2I) generation aims to create new, text-guided images featuring the personalized subject with a few reference images. However, balancing the trade-off relationship between prompt fidelity and identity preservation remains a critical challenge. To address the issue, we propose a novel P-T2I method called Layout-and-Retouch, consisting of two stages: 1) layout generation and 2) retouch. In the first stage, our step-blended inference utilizes the inherent sample diversity of vanilla T2I models to produce diversified layout images, while also enhancing prompt fidelity. In the second stage, multi-source attention swapping integrates the context image from the first stage with the reference image, leveraging the structure from the context image and extracting visual features from the reference image. This achieves high prompt fidelity while preserving identity characteristics. Through our extensive experiments, we demonstrate that our method generates a wide variety of images with diverse layouts while maintaining the unique identity features of the personalized objects, even with challenging text prompts. This versatility highlights the potential of our framework to handle complex conditions, significantly enhancing the diversity and applicability of personalized image synthesis.

Layout-and-Retouch: A Dual-stage Framework for Improving Diversity in Personalized Image Generation

TL;DR

This work tackles the challenge of balancing prompt fidelity and identity preservation in personalized text-to-image generation. It introduces Layout-and-Retouch, a two-stage framework that first uses step-blended denoising with vanilla T2I models to generate diverse layouts, then retouches the subject by performing multi-source attention swap with a reference image to preserve identity while adhering to the prompt. Empirical results on ViCo and DreamMatcher benchmarks show improved layout diversity, stronger prompt fidelity, and robust identity preservation compared to plug-in baselines, including under challenging prompts. The method offers a practical, plug-in friendly strategy for diverse, personalized image synthesis and points to future improvements by adopting stronger backbone models to further enhance layout understanding and consistency.

Abstract

Personalized text-to-image (P-T2I) generation aims to create new, text-guided images featuring the personalized subject with a few reference images. However, balancing the trade-off relationship between prompt fidelity and identity preservation remains a critical challenge. To address the issue, we propose a novel P-T2I method called Layout-and-Retouch, consisting of two stages: 1) layout generation and 2) retouch. In the first stage, our step-blended inference utilizes the inherent sample diversity of vanilla T2I models to produce diversified layout images, while also enhancing prompt fidelity. In the second stage, multi-source attention swapping integrates the context image from the first stage with the reference image, leveraging the structure from the context image and extracting visual features from the reference image. This achieves high prompt fidelity while preserving identity characteristics. Through our extensive experiments, we demonstrate that our method generates a wide variety of images with diverse layouts while maintaining the unique identity features of the personalized objects, even with challenging text prompts. This versatility highlights the potential of our framework to handle complex conditions, significantly enhancing the diversity and applicability of personalized image synthesis.
Paper Structure (44 sections, 6 equations, 19 figures, 5 tables, 1 algorithm)

This paper contains 44 sections, 6 equations, 19 figures, 5 tables, 1 algorithm.

Figures (19)

  • Figure 1: Visualizations of center-point distributions of subjects. Using the ViCo hao2023vico evaluation prompt, we generate 10 images per prompt with both Stable Diffusion (SD) rombach2022high and Dreambooth ruiz2023dreambooth trained with (a) the subject of reference images. For each subject, we locate the bounding box of an object. We then compute the 2D center-point distribution by fitting Gaussian distributions to each center point, with the mean being the center point itself and the variance fixed, and then normalize all distributions. (b) Vanilla SD places objects across a wider range compared to (c) fine-tuned SD, indicating that fine-tuned SD has a weaker ability to generate a diverse range of image layouts. $\sigma^2_{avg}$ denotes the averaged variances of 2D center points, meaning that the center point set is more widely dispersed with vanilla SD.
  • Figure 2: Overall pipeline of Layout-and-Retouch: (a) In the layout generation step, we perform (I) step-blended denoising using vanilla and personalized T2I models. Different subject-related words (e.g., red vase and <*>) are fed to each model. (b) The retouch step focuses on calibrating the target subject while maintaining the layout image structure. This is achieved using (II) multi-source attention swap, where intermediate variables from the attention layers of other denoising paths are used to create the target image, and (III) adaptive mask blending, which combines $\mathbf{M^{\text{SAM}}}$ and a cross-attention map to generate an accurate mask for blending feature maps in the self-attention layer.
  • Figure 3: Qualitative comparisons with challenging prompts. Our method produces images with poses and scales significantly different from the reference images compared to other methods. Additionally, it excels at generating images that accurately follow challenging prompts. It is noteworthy that our method does not produce images with identical structures, benefiting from step-blended denoising.
  • Figure 4: Illustrations of diversity analysis. (a) We measure the IS score across individual objects, and verifying our method has a strong tendency to generate more diverse images. (b) Our methodology is distributed more broadly within the same prompt. This means that our approach is capable of generating images with diverse configurations.
  • Figure 5: Layout Failure cases. (a) shows our generation results when layout fails to generate appropriate context, (b) shows when layout fails to faithfully generate shape of personalized concept. Since second stage of our pipeline lies on layout image, our pipeline can fail in prompts beyond the capacity of pre-trained SD backbone.
  • ...and 14 more figures