Table of Contents
Fetching ...

Dense-Face: Personalized Face Generation Model via Dense Annotation Prediction

Xiao Guo, Manh Tran, Jiaxin Cheng, Xiaoming Liu

TL;DR

Dense-Face presents a diffusion-based personalized face generation framework that preserves identity and aligns with text prompts by freezing a pre-trained Stable Diffusion and introducing a pose-controllable adapter, a pose branch, and a dense-face annotation predictor. It enables two generation modes—text-editing and face-generation—whose outputs are merged through latent-space blending to produce identity-consistent images with accurate pose control, while maintaining text controllability. The approach is supported by the large T2I-Dense-Face dataset, which provides dense annotations (landmarks, depth, pseudo masks) and pose information to learn domain-specific face generation knowledge. Experiments show Dense-Face achieves strong image-text alignment, high fidelity, and robust identity preservation, outperforming several baselines and demonstrating versatile face manipulation capabilities with pose variation. The work offers a scalable path for high-quality, identity-consistent face generation conditioned on text, pose, and reference subjects, along with a valuable dense-annotation dataset for future research.

Abstract

The text-to-image (T2I) personalization diffusion model can generate images of the novel concept based on the user input text caption. However, existing T2I personalized methods either require test-time fine-tuning or fail to generate images that align well with the given text caption. In this work, we propose a new T2I personalization diffusion model, Dense-Face, which can generate face images with a consistent identity as the given reference subject and align well with the text caption. Specifically, we introduce a pose-controllable adapter for the high-fidelity image generation while maintaining the text-based editing ability of the pre-trained stable diffusion (SD). Additionally, we use internal features of the SD UNet to predict dense face annotations, enabling the proposed method to gain domain knowledge in face generation. Empirically, our method achieves state-of-the-art or competitive generation performance in image-text alignment, identity preservation, and pose control.

Dense-Face: Personalized Face Generation Model via Dense Annotation Prediction

TL;DR

Dense-Face presents a diffusion-based personalized face generation framework that preserves identity and aligns with text prompts by freezing a pre-trained Stable Diffusion and introducing a pose-controllable adapter, a pose branch, and a dense-face annotation predictor. It enables two generation modes—text-editing and face-generation—whose outputs are merged through latent-space blending to produce identity-consistent images with accurate pose control, while maintaining text controllability. The approach is supported by the large T2I-Dense-Face dataset, which provides dense annotations (landmarks, depth, pseudo masks) and pose information to learn domain-specific face generation knowledge. Experiments show Dense-Face achieves strong image-text alignment, high fidelity, and robust identity preservation, outperforming several baselines and demonstrating versatile face manipulation capabilities with pose variation. The work offers a scalable path for high-quality, identity-consistent face generation conditioned on text, pose, and reference subjects, along with a valuable dense-annotation dataset for future research.

Abstract

The text-to-image (T2I) personalization diffusion model can generate images of the novel concept based on the user input text caption. However, existing T2I personalized methods either require test-time fine-tuning or fail to generate images that align well with the given text caption. In this work, we propose a new T2I personalization diffusion model, Dense-Face, which can generate face images with a consistent identity as the given reference subject and align well with the text caption. Specifically, we introduce a pose-controllable adapter for the high-fidelity image generation while maintaining the text-based editing ability of the pre-trained stable diffusion (SD). Additionally, we use internal features of the SD UNet to predict dense face annotations, enabling the proposed method to gain domain knowledge in face generation. Empirically, our method achieves state-of-the-art or competitive generation performance in image-text alignment, identity preservation, and pose control.

Paper Structure

This paper contains 15 sections, 8 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Additional comparisons among different personalized generation methods. Our proposed Dense-Face generates images with a consistent identity with the reference image, which can even be an old photo.
  • Figure 2: (a) Our proposed Dense-Face introduces additional components, including a pose branch and PC-adapter, on the top of the pre-trained SD. These two components enable Dense-Face to have two generation modes: text-editing mode and face-generation mode. These two modes are jointly used via the latent space blending (Sec. \ref{['sec:inference']}) for the personalized generation. For example, given one of reference subject images, the text-editing mode generates a base image, and face-generation mode updates the face region for the identity-preservation in the final result. (b) Dense-Face in face-generation mode generates realistic face images at different pose views.
  • Figure 2: Dense-Face can place subjects in diverse contexts with changed attributes, such as hair color and clothes.
  • Figure 3: We propose Dense-Face for personalized image generation, which introduces additional components, such as a pose-controllable (PC) adapter, pose branch (i.e., $\bm{\epsilon}_{pose}$) and annotation prediction module (i.e., $\bm{\epsilon}_{dense}$) on the top of the pre-trained T2I-SD. The input includes captions, head pose and reference image (i.e., $\mathbf{I}_{pose}$ and $\mathbf{I}_{id}$). The output includes generated faces ($\mathbf{I}_{tar.}$) and dense face annotations (e.g., face depths ($\mathbf{D}$), pseudo masks ($\mathbf{P}$), and landmarks ($\mathbf{L}$)). We only train $\bm{\epsilon}_{pose}$, $\bm{\epsilon}_{dense}$ and the PC adapter in training and freeze the pre-trained SD. The PC-adapter (${\mathbf{w}^{\prime}}^{q}, {\mathbf{w}^{\prime}}^{v}$, and ${\mathbf{w}^{\prime}}^{v}$) modifies cross attention module ($\mathbf{w}^{q}, \mathbf{w}^{v}$, and $\mathbf{w}^{k}$) forward propagation, from $\mathbf{f}_{out}$ (orange dash line) to $\mathbf{f}^{\prime}_{out}$ (red solid line). $\bm{\epsilon}_{dense}$ utilizes the internal UNet features (i.e., $\mathbf{f}_{dense}$) for predicting dense face annotations.
  • Figure 3: Two samples from the proposed T2I-Dense-Face.
  • ...and 7 more figures