Table of Contents
Fetching ...

ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving

Jiehui Huang, Xiao Dong, Wenhui Song, Zheng Chong, Zhenchao Tang, Jun Zhou, Yuhao Cheng, Long Chen, Hanhui Li, Yiqiang Yan, Shengcai Liao, Xiaodan Liang

TL;DR

<3-5 sentence high-level summary> ConsistentID tackles the challenge of identity-preserving portrait generation from a single reference image by coupling a multimodal facial prompt generator with an ID-preservation network guided by facial attention localization. It introduces FGID, a large fine-grained facial dataset with region-level descriptions and identity features to enable detailed ID control. The method demonstrates superior identity fidelity and detail preservation on the MyStyle benchmark, while maintaining fast inference, and benefits from ablations and human studies that validate the value of fine-grained prompts and region-aware attention. This work advances fine-grained control in diffusion-based personalized portrait synthesis and provides a new dataset and evaluation framework for facial identity research.

Abstract

Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.

ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving

TL;DR

<3-5 sentence high-level summary> ConsistentID tackles the challenge of identity-preserving portrait generation from a single reference image by coupling a multimodal facial prompt generator with an ID-preservation network guided by facial attention localization. It introduces FGID, a large fine-grained facial dataset with region-level descriptions and identity features to enable detailed ID control. The method demonstrates superior identity fidelity and detail preservation on the MyStyle benchmark, while maintaining fast inference, and benefits from ablations and human studies that validate the value of fine-grained prompts and region-aware attention. This work advances fine-grained control in diffusion-based personalized portrait synthesis and provides a new dataset and evaluation framework for facial identity research.

Abstract

Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.
Paper Structure (16 sections, 2 equations, 17 figures, 6 tables)

This paper contains 16 sections, 2 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Given some images of input IDs, our ConsistentID can generate diverse personalized ID images based on text prompts using only a single image.
  • Figure 2: The overall framework of our proposed ConsistentID. The framework comprises two key modules: a multimodal facial ID generator and a purposefully crafted ID-preservation network. The multimodal facial prompt generator consists of two essential components: a fine-grained multimodal feature extractor, which focuses on capturing detailed facial information, and a facial ID feature extractor dedicated to learning facial ID features. On the other hand, the ID-preservation network utilizes both facial textual and visual prompts, preventing the blending of ID information from different facial regions through the facial attention localization strategy. This approach ensures the preservation of ID consistency in the facial regions.
  • Figure 3: The framework of our facial encoder for generating fine-grained multimodal facial features.
  • Figure 4: The statistical characteristics of age and gender distribution in the FGID training dataset.
  • Figure 5: User preferences across image fidelity, fine-grained ID fidelity, overall ID fidelity for different methods.
  • ...and 12 more figures