Table of Contents
Fetching ...

UniF$^2$ace: A Unified Fine-grained Face Understanding and Generation Model

Junzhe Li, Sifan Zhou, Liya Guo, Xuerui Qiu, Linrui Xu, Delin Qu, Tingting Long, Chun Fan, Ming Li, Hehe Fan, Jun Liu, Shuicheng Yan

TL;DR

A novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss is introduced, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood, which significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input.

Abstract

Unified multimodal models (UMMs) have emerged as a powerful paradigm in fundamental cross-modality research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily faces two challenges: $\textbf{(1)}$ $\textbf{fragmentation development}$, with existing methods failing to unify understanding and generation into a single one, hindering the way to artificial general intelligence. $\textbf{(2) lack of fine-grained facial attributes}$, which are crucial for high-fidelity applications. To handle those issues, we propose $\textbf{UniF$^2$ace}$, $\textit{the first UMM specifically tailored for fine-grained face understanding and generation}$. $\textbf{First}$, we introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood. Moreover, this D3Diff significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input. $\textbf{Second}$, we propose a multi-level grouped Mixture-of-Experts architecture, adaptively incorporating the semantic and identity facial embeddings to complement the attribute forgotten phenomenon in representation evolvement. $\textbf{Finally}$, to this end, we construct UniF$^2$aceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs, spanning a much wider range of facial attributes than existing datasets. Extensive experiments demonstrate that UniF$^2$ace outperforms existing models with a similar scale in both understanding and generation tasks, with 7.1\% higher Desc-GPT and 6.6\% higher VQA-score, respectively.

UniF$^2$ace: A Unified Fine-grained Face Understanding and Generation Model

TL;DR

A novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss is introduced, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood, which significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input.

Abstract

Unified multimodal models (UMMs) have emerged as a powerful paradigm in fundamental cross-modality research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily faces two challenges: , with existing methods failing to unify understanding and generation into a single one, hindering the way to artificial general intelligence. , which are crucial for high-fidelity applications. To handle those issues, we propose ^2, . , we introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood. Moreover, this D3Diff significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input. , we propose a multi-level grouped Mixture-of-Experts architecture, adaptively incorporating the semantic and identity facial embeddings to complement the attribute forgotten phenomenon in representation evolvement. , to this end, we construct UniFaceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs, spanning a much wider range of facial attributes than existing datasets. Extensive experiments demonstrate that UniFace outperforms existing models with a similar scale in both understanding and generation tasks, with 7.1\% higher Desc-GPT and 6.6\% higher VQA-score, respectively.

Paper Structure

This paper contains 24 sections, 1 theorem, 29 equations, 12 figures, 8 tables.

Key Result

Theorem 1

Let $-\log p_\theta({\mathbf{x}}_0)$ denote the negative log-likelihood of the original data distribution. Then the following inequality holds:

Figures (12)

  • Figure 1: UniF$^2$ace is the first unified multimodal model designed for face understanding and generation, encompassing tasks such as visual question answering(VQA) and text-to-image generation. The generated responses and images demonstrate UniF$^2$ace’s potential in fine-grained face attributes.
  • Figure 2: Our UniF$^2$ace centered on two key innovations. First, we design the Transformer with Mixture-of-Experts (MoE) hierarchy: a token-level MoE provides task-specific routing for individual tokens, while a sequence-level MoE injects holistic, domain-specific features. Second, the model's generative capability is optimized by our proposed D3Diff loss, which unifies masked generation with score matching to ensure high-fidelity synthesis of fine-grained facial details.
  • Figure 3: Clip/Face Expert enhances the model's understanding of fine-grained facial attributes by incorporating semantic and identity features.
  • Figure 4: UniF$^2$aceD-1M contains high-resolution facial images, the largest number of facial attributes, 130K fine-grained image-caption pairs and 1 million VQAs.
  • Figure 5: Comparative analysis of face images generation quality across SDXL podell2023sdxl, TokenFlow qu2024tokenflow, OmniFlow li2024omniflow, Show-o xie2024show, and UniF$^2$ace. Our proposed UniF$^2$ace effectively captures more detailed information from prompts.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Theorem 1