Table of Contents
Fetching ...

MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation

Yuxiang Wei, Zhilong Ji, Jinfeng Bai, Hongzhi Zhang, Lei Zhang, Wangmeng Zuo

TL;DR

MasterWeaver tackles the problem of generating personalized images with faithful identity and flexible text-driven editability in a tuning-free setting. It introduces an identity encoder coupled with dual cross-attention to inject identity features, plus an editing-direction loss that aligns MasterWeaver's editing directions with those of the base T2I model, and a face-augmented dataset to disentangle identity from attributes. The approach is augmented with a background-disentanglement loss and a formal learning objective that balances reconstruction, editing controllability, and background stability. Experimental results on one-shot and few-shot setups show superior text controllability and competitive identity fidelity, with strong performance in both qualitative and quantitative evaluations and reasonable inference speed. These contributions enable practical, efficient personalized generation and offer a pathway to broader applications in editable, identity-preserving diffusion-based synthesis, while highlighting ethical considerations for real-world use.

Abstract

Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information, resulting in unsatisfied text controllability, especially on faces. In this work, we present MasterWeaver, a test-time tuning-free method designed to generate personalized images with both faithful identity fidelity and flexible editability. Specifically, MasterWeaver adopts an encoder to extract identity features and steers the image generation through additional introduced cross attention. To improve editability while maintaining identity fidelity, we propose an editing direction loss for training, which aligns the editing directions of our MasterWeaver with those of the original T2I model. Additionally, a face-augmented dataset is constructed to facilitate disentangled identity learning, and further improve the editability. Extensive experiments demonstrate that our MasterWeaver can not only generate personalized images with faithful identity, but also exhibit superiority in text controllability. Our code can be found at https://github.com/csyxwei/MasterWeaver.

MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation

TL;DR

MasterWeaver tackles the problem of generating personalized images with faithful identity and flexible text-driven editability in a tuning-free setting. It introduces an identity encoder coupled with dual cross-attention to inject identity features, plus an editing-direction loss that aligns MasterWeaver's editing directions with those of the base T2I model, and a face-augmented dataset to disentangle identity from attributes. The approach is augmented with a background-disentanglement loss and a formal learning objective that balances reconstruction, editing controllability, and background stability. Experimental results on one-shot and few-shot setups show superior text controllability and competitive identity fidelity, with strong performance in both qualitative and quantitative evaluations and reasonable inference speed. These contributions enable practical, efficient personalized generation and offer a pathway to broader applications in editable, identity-preserving diffusion-based synthesis, while highlighting ethical considerations for real-world use.

Abstract

Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information, resulting in unsatisfied text controllability, especially on faces. In this work, we present MasterWeaver, a test-time tuning-free method designed to generate personalized images with both faithful identity fidelity and flexible editability. Specifically, MasterWeaver adopts an encoder to extract identity features and steers the image generation through additional introduced cross attention. To improve editability while maintaining identity fidelity, we propose an editing direction loss for training, which aligns the editing directions of our MasterWeaver with those of the original T2I model. Additionally, a face-augmented dataset is constructed to facilitate disentangled identity learning, and further improve the editability. Extensive experiments demonstrate that our MasterWeaver can not only generate personalized images with faithful identity, but also exhibit superiority in text controllability. Our code can be found at https://github.com/csyxwei/MasterWeaver.
Paper Structure (33 sections, 11 equations, 20 figures, 10 tables)

This paper contains 33 sections, 11 equations, 20 figures, 10 tables.

Figures (20)

  • Figure 1: With one single reference image, our MasterWeaver can generate photo-realistic personalized images with diverse clothing, accessories, facial attributes and actions in various contexts. In comparison with existing methods, our method exhibits superior editability while maintaining high identity fidelity.
  • Figure 2: (a) Training pipeline of our MasterWeaver. Specifically, to improve the editability while maintaining identity fidelity, we propose an editing direction loss $\mathcal{L}_{edit}$ for training. Additionally, we construct a face-augmented dataset to facilitate disentangled identity learning, further improving editability. (b) Framework of our MasterWeaver. It adopts an encoder to extract identity features and employ it with text to steer personalized image generation through cross attention.
  • Figure 3: Illustration of Editing Direction Loss. By inputting paired text prompts that denote an editing operation, e.g., (a photo of a woman, a photo of a smiling woman), we identify the editing direction in the feature space of diffusion model. Then we align the editing direction of MasterWeaver with that of original T2I model to improve the text controllability without affecting the identity.
  • Figure 4: Construction of Face-Augmented Dataset. We employ E4E richardson2021encoding and DeltaEdit lyu2023deltaedit to edit the attribute of the reference identity image, and construct the face-augmented dataset.
  • Figure 5: Visual comparison of different methods. All images are generated using the single reference image shown on the left. Our MasterWeaver can generate high-quality images with flexible editability and faithful identity. Zoom in for a better view.
  • ...and 15 more figures