Table of Contents
Fetching ...

Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation

Yang Li, Songlin Yang, Wei Wang, Jing Dong

TL;DR

The paper addresses the challenge of semantically faithful personalized diffusion-based image generation for non-famous identities, where prior methods suffer from attention overfit and weak semantic control. It introduces two contributions: Face-Wise Attention Loss to constrain ID-related attention to the face region, and Semantic-Fidelity Token Optimization that represents an ID with five per-stage K-V token pairs, expanding the textual conditioning space. The approach yields higher ID accuracy, improved prompt–image alignment, and robust manipulation of scenes, facial attributes, and actions while requiring only a single image and no external facial priors, with efficient one-shot fine-tuning. The method generalizes beyond faces to other concepts and remains compatible with newer diffusion models such as SDXL, offering practical, scalable personalization for semantic-rich T2I generation.

Abstract

Advanced diffusion-based Text-to-Image (T2I) models, such as the Stable Diffusion Model, have made significant progress in generating diverse and high-quality images using text prompts alone. However, when non-famous users require personalized image generation for their identities (IDs), the T2I models fail to accurately generate their ID-related images. The main problem is that pre-trained T2I models do not learn the mapping between the new ID prompts and their corresponding visual content. The previous methods either failed to accurately fit the face region or lost the interactive generative ability with other existing concepts in T2I models. In other words, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (``Eiffel Tower''), actions (``holding a basketball''), and facial attributes (``eyes closed''). In this paper, we focus on inserting accurate and interactive ID embedding into the Stable Diffusion Model for semantic-fidelity personalized generation. We address this challenge from two perspectives: face-wise region fitting and semantic-fidelity token optimization. Specifically, we first visualize the attention overfit problem and propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. This key trick significantly enhances the ID accuracy and interactive generative ability with other existing concepts. Then, we optimize one ID representation as multiple per-stage tokens where each token contains two disentangled features. This expansion of the textual conditioning space improves semantic-fidelity control. Extensive experiments validate that our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.

Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation

TL;DR

The paper addresses the challenge of semantically faithful personalized diffusion-based image generation for non-famous identities, where prior methods suffer from attention overfit and weak semantic control. It introduces two contributions: Face-Wise Attention Loss to constrain ID-related attention to the face region, and Semantic-Fidelity Token Optimization that represents an ID with five per-stage K-V token pairs, expanding the textual conditioning space. The approach yields higher ID accuracy, improved prompt–image alignment, and robust manipulation of scenes, facial attributes, and actions while requiring only a single image and no external facial priors, with efficient one-shot fine-tuning. The method generalizes beyond faces to other concepts and remains compatible with newer diffusion models such as SDXL, offering practical, scalable personalization for semantic-rich T2I generation.

Abstract

Advanced diffusion-based Text-to-Image (T2I) models, such as the Stable Diffusion Model, have made significant progress in generating diverse and high-quality images using text prompts alone. However, when non-famous users require personalized image generation for their identities (IDs), the T2I models fail to accurately generate their ID-related images. The main problem is that pre-trained T2I models do not learn the mapping between the new ID prompts and their corresponding visual content. The previous methods either failed to accurately fit the face region or lost the interactive generative ability with other existing concepts in T2I models. In other words, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (``Eiffel Tower''), actions (``holding a basketball''), and facial attributes (``eyes closed''). In this paper, we focus on inserting accurate and interactive ID embedding into the Stable Diffusion Model for semantic-fidelity personalized generation. We address this challenge from two perspectives: face-wise region fitting and semantic-fidelity token optimization. Specifically, we first visualize the attention overfit problem and propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. This key trick significantly enhances the ID accuracy and interactive generative ability with other existing concepts. Then, we optimize one ID representation as multiple per-stage tokens where each token contains two disentangled features. This expansion of the textual conditioning space improves semantic-fidelity control. Extensive experiments validate that our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
Paper Structure (19 sections, 6 equations, 16 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 6 equations, 16 figures, 3 tables, 1 algorithm.

Figures (16)

  • Figure 1: Previous methods for inserting new identities (IDs) into pre-trained Text-to-Image diffusion models for personalized generation have two problems: (1) Attention Overfit : As shown in the activation maps of Textural Inversion gal2022image and ProSpect zhang2023prospect, their "V*" attention nearly takes over the whole images, which means the learned embeddings try to encode both the human faces and ID-unrelated information in the reference images, such as the face region layout and background. This problem extremely limits their generative ability and disrupts their interaction with other existing concepts such as "cup", which results in the failure of the given prompt (i.e., they fail to generate the image content aligned with the given prompt). (2) Limited Semantic-Fidelity: Despite alleviating overfit, Celeb Basis yuan2023inserting introduces excessive face prior, limiting the semantic-fidelity of the learned ID embedding (e.g., the "cup" attention still continues to the "V*" face region and this limitation hinders the control of facial attributes such as "eyes closed"). Therefore, we propose Face-Wise Region Fit (Sec. \ref{['attention loss']}) and Semantic-Fidelity Token Optimization (Sec. \ref{['k-v feature disentangle']}) to address problem (1) and (2) respectively. More results: https://com-vis.github.io/SeFi-IDE/.
  • Figure 2: The overview of our framework. We first propose a novel Face-Wise Attention Loss (Sec. \ref{['attention loss']}) to alleviate the attention overfit problem and make the ID embedding focus on the face region to improve ID accuracy and interactive generative ability. Then, we optimize the target ID embedding as five per-stage tokens pairs with disentangled features to expend textural conditioning space with semantic-fidelity control ability (Sec. \ref{['k-v feature disentangle']}).
  • Figure 3: The details of text condition and K-V feature implementation differences.
  • Figure 4: The different effects of $\bm{P_{i}^{K}}$ and $\bm{P_{i}^{V}}$ tokens. (1) Progressively Adding: We add different ${\{(\bm{P_i^{K}}, \bm{P_i^{V})}\}}_{1\leq i \leq 5}$ tokens to the conditioning information in ten steps. We found that the initial tokens effect more the layout of generation content (e.g., face region location, and poses), while the latter tokens effect more the ID-related details. (2) Progressively Substituting: We then substitute different $\bm{P_{i}^{K}}$ and $\bm{P_{i}^{V}}$ tokens of ${\{(\bm{P_i^{K}}, \bm{P_i^{V})}\}}_{1\leq i \leq 5}$. We found that $\bm{P_{i}^{V}}$ contribute to the vast majority of ID-related conditioning information, and the $\bm{P_{i}^{K}}$ contribute more to textural details, such as environment lighting.
  • Figure 5: The details of Self-Attention module. For simplicity, we disregard the remaining embeddings in $\bm{Y_{t}}$ and focus on the ID embedding $\bm{P}$ associated with the pseudo-word "V*".
  • ...and 11 more figures