AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

Lianyu Pang; Jian Yin; Baoquan Zhao; Feize Wu; Fu Lee Wang; Qing Li; Xudong Mao

AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

Lianyu Pang, Jian Yin, Baoquan Zhao, Feize Wu, Fu Lee Wang, Qing Li, Xudong Mao

TL;DR

AttnDreamBooth identifies embedding misalignment as the root cause of the conflicting behaviors of Textual Inversion and DreamBooth in personalized text-to-image generation. It proposes a three-stage framework that separately learns embedding alignment, refines cross-attention, and then captures subject identity, all while keeping the text encoder fixed and adding a cross-attention map regularization to align attention with both the new concept and its super-category. Empirical results show AttnDreamBooth achieving strong identity preservation and text alignment, including complex prompts, with a lightweight training protocol (~20 minutes per concept) and favorable user study outcomes. This method advances practical personalized generation by enabling more reliable, text-aligned synthesis across diverse prompts and styles.

Abstract

Recent advances in text-to-image models have enabled high-quality personalized image synthesis of user-provided concepts with flexible textual control. In this work, we analyze the limitations of two primary techniques in text-to-image personalization: Textual Inversion and DreamBooth. When integrating the learned concept into new prompts, Textual Inversion tends to overfit the concept, while DreamBooth often overlooks it. We attribute these issues to the incorrect learning of the embedding alignment for the concept. We introduce AttnDreamBooth, a novel approach that addresses these issues by separately learning the embedding alignment, the attention map, and the subject identity in different training stages. We also introduce a cross-attention map regularization term to enhance the learning of the attention map. Our method demonstrates significant improvements in identity preservation and text alignment compared to the baseline methods.

AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

TL;DR

Abstract

Paper Structure (39 sections, 2 equations, 14 figures, 3 tables)

This paper contains 39 sections, 2 equations, 14 figures, 3 tables.

Introduction
Related Work
Text-to-Image Generation.
Text-to-Image Personalization.
Cross-Attention Control.
Multi-Stage Personalization.
Preliminaries
Latent Diffusion Models.
Textual Inversion.
DreamBooth.
Method
Analysis of Existing Methods
Problems and Analysis.
A Naive Solution.
AttnDreamBooth
...and 24 more sections

Figures (14)

Figure 1: Our method enables text-aligned text-to-image personalization with complex prompts.
Figure 2: Analysis of two principal methods. We visualize the cross-attention maps corresponding to the new concept and other tokens in the prompt. Textual Inversion textual-inversion tends to overfit the textual embedding of the learned concept, resulting in incorrect attention map allocations to other tokens (e.g., "drawing" or "box"). In contrast, DreamBooth dreambooth appears to overlook the learned concept, producing images primarily based on other tokens.
Figure 3: Overview of AttnDreamBooth. Our method consists of three training stages. In Stage 1, we optimize the textual embedding of the new concept to align its embedding with existing tokens. In Stage 2, we fine-tune the cross-attention layers to refine the attention map. In Stage 3, we fine-tune the entire U-net to capture the subject identity. Moreover, we introduce a cross-attention map regularization term to guide the learning of the attention map.
Figure 4: Analysis of TI+DB. Column (a) demonstrates that TI+DB neglects the learned concept when integrating it into a new prompt, "A painting of a [V] toy in the style of Monet". Column (b) shows the generated images based on a single word prompt, "[V]", both before and after fine-tuning, using the diffusion model without fine-tuning. These images are notably similar to each other, which indicates that the learned textual embedding remains largely unchanged from its initial state.
Figure 5: Results after each training stage. We present the generations along with the attention maps of "[V]" for each stage. In stage 1, the model properly aligns the embedding of [V] with other tokens, "inside a box", but learns a very coarse attention map and subject identity. In stage 2, the model refines the attention map and subject identity. In stage 3, the model accurately captures the identity of the concept.
...and 9 more figures

AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

TL;DR

Abstract

AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)