Dense-Face: Personalized Face Generation Model via Dense Annotation Prediction
Xiao Guo, Manh Tran, Jiaxin Cheng, Xiaoming Liu
TL;DR
Dense-Face presents a diffusion-based personalized face generation framework that preserves identity and aligns with text prompts by freezing a pre-trained Stable Diffusion and introducing a pose-controllable adapter, a pose branch, and a dense-face annotation predictor. It enables two generation modes—text-editing and face-generation—whose outputs are merged through latent-space blending to produce identity-consistent images with accurate pose control, while maintaining text controllability. The approach is supported by the large T2I-Dense-Face dataset, which provides dense annotations (landmarks, depth, pseudo masks) and pose information to learn domain-specific face generation knowledge. Experiments show Dense-Face achieves strong image-text alignment, high fidelity, and robust identity preservation, outperforming several baselines and demonstrating versatile face manipulation capabilities with pose variation. The work offers a scalable path for high-quality, identity-consistent face generation conditioned on text, pose, and reference subjects, along with a valuable dense-annotation dataset for future research.
Abstract
The text-to-image (T2I) personalization diffusion model can generate images of the novel concept based on the user input text caption. However, existing T2I personalized methods either require test-time fine-tuning or fail to generate images that align well with the given text caption. In this work, we propose a new T2I personalization diffusion model, Dense-Face, which can generate face images with a consistent identity as the given reference subject and align well with the text caption. Specifically, we introduce a pose-controllable adapter for the high-fidelity image generation while maintaining the text-based editing ability of the pre-trained stable diffusion (SD). Additionally, we use internal features of the SD UNet to predict dense face annotations, enabling the proposed method to gain domain knowledge in face generation. Empirically, our method achieves state-of-the-art or competitive generation performance in image-text alignment, identity preservation, and pose control.
