Table of Contents
Fetching ...

A Generalist FaceX via Learning Unified Facial Representation

Yue Han, Jiangning Zhang, Junwei Zhu, Xiangtai Li, Yanhao Ge, Wei Li, Chengjie Wang, Yong Liu, Xiaoming Liu, Ying Tai

TL;DR

FaceX tackles the challenge of building a single generalist facial editing framework capable of handling many tasks without task-specific training. It introduces a unified Omni-Representation with FORD for decomposition, FORS for assembling and steering, and FRC for efficient diffusion conditioning, all built atop a pretrained Stable Diffusion model. Extensive experiments across multiple tasks demonstrate competitive performance and the ability to mix attributes across regions and tasks, with ablations validating each component. The approach reduces R&D costs for multi-task facial editing and enables flexible, mixture editing in a single model, while noting limitations and safety considerations for synthetic faces.

Abstract

This work presents FaceX framework, a novel facial generalist model capable of handling diverse facial tasks simultaneously. To achieve this goal, we initially formulate a unified facial representation for a broad spectrum of facial editing tasks, which macroscopically decomposes a face into fundamental identity, intra-personal variation, and environmental factors. Based on this, we introduce Facial Omni-Representation Decomposing (FORD) for seamless manipulation of various facial components, microscopically decomposing the core aspects of most facial editing tasks. Furthermore, by leveraging the prior of a pretrained StableDiffusion (SD) to enhance generation quality and accelerate training, we design Facial Omni-Representation Steering (FORS) to first assemble unified facial representations and then effectively steer the SD-aware generation process by the efficient Facial Representation Controller (FRC). %Without any additional features, Our versatile FaceX achieves competitive performance compared to elaborate task-specific models on popular facial editing tasks. Full codes and models will be available at https://github.com/diffusion-facex/FaceX.

A Generalist FaceX via Learning Unified Facial Representation

TL;DR

FaceX tackles the challenge of building a single generalist facial editing framework capable of handling many tasks without task-specific training. It introduces a unified Omni-Representation with FORD for decomposition, FORS for assembling and steering, and FRC for efficient diffusion conditioning, all built atop a pretrained Stable Diffusion model. Extensive experiments across multiple tasks demonstrate competitive performance and the ability to mix attributes across regions and tasks, with ablations validating each component. The approach reduces R&D costs for multi-task facial editing and enables flexible, mixture editing in a single model, while noting limitations and safety considerations for synthetic faces.

Abstract

This work presents FaceX framework, a novel facial generalist model capable of handling diverse facial tasks simultaneously. To achieve this goal, we initially formulate a unified facial representation for a broad spectrum of facial editing tasks, which macroscopically decomposes a face into fundamental identity, intra-personal variation, and environmental factors. Based on this, we introduce Facial Omni-Representation Decomposing (FORD) for seamless manipulation of various facial components, microscopically decomposing the core aspects of most facial editing tasks. Furthermore, by leveraging the prior of a pretrained StableDiffusion (SD) to enhance generation quality and accelerate training, we design Facial Omni-Representation Steering (FORS) to first assemble unified facial representations and then effectively steer the SD-aware generation process by the efficient Facial Representation Controller (FRC). %Without any additional features, Our versatile FaceX achieves competitive performance compared to elaborate task-specific models on popular facial editing tasks. Full codes and models will be available at https://github.com/diffusion-facex/FaceX.
Paper Structure (13 sections, 3 equations, 12 figures, 3 tables)

This paper contains 13 sections, 3 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Facial generalist $\mathtt{FaceX}$ is capable of handling diverse facial tasks, ranging from popular face/head swapping and motion-aware face reenactment/animation to semantic-aware attribute editing/inpainting, by one unified model, simultaneously achieving competitive performance that significantly advances the research of general facial models.
  • Figure 2: Left: Proposed facial omni-representation equation that divides one face into a combination of different fine-grained attributes. Right: The attributes of the generated images under different tasks correspond to the decomposition of source and target facial attributes. Here, we analyze four representative facial tasks. For details of other facial tasks, please refer to our supplementary materials.
  • Figure 3: Overview of the $\mathtt{FaceX}$ framework, which consists of: 1)Facial Omni-Representation Decomposing (FORD)$\bm{\varphi}=\{\bm{\varphi}^{ID}, \bm{\varphi}^{Reg}, \bm{\varphi}^{Parse}, \bm{\varphi}^{3DMM}, \bm{\varphi}^{Gaze}\}$ decomposes facial component representations, i.e., $\bm{f}^{ID}$, $\bm{f}^{R}$, $\bm{f}^{L}$, $\bm{f}^{T}$, $\bm{f}^{E}$, $\bm{f}^{P}$, and $\bm{f}^{G}$. 2)Facial Omni-Representation Steering (FORS)$\bm{\phi}$ contains a Task-specific Representation Assembler to assemble various attributes extracted from source image $\bm{I}^{S}$ and target image $\bm{I}^{T}$, which pass through a Representation Adapter $\bm{\phi}^{R}$ to yield $\bm{f}^{Rep}$; and a Task-specific Region Assembler to assemble different regions to obtain the inpainting reference image $\bm{I}^{R}$, which is then processed by an image encoder $\bm{\phi}^{Inp}$ to obtain $\bm{f}^{Inp}$. After concatenation with $\bm{f}^{Rep}$, it is processed by the SD Adapter $\bm{\phi}^{SD}$ to obtain the conditional representation $\bm{f}^{SD}$ that is fed into the conditional denoising U-Net $\boldsymbol{\epsilon}_\theta$. 3)Facial Representation Controller (FRC), given the basic concatenation of fixed self-/cross-attention operations, we add one extra cross-attention layer. Under the control of $\bm{f}^{SD}$, it enables generating task-specific output images $\bm{I}^{O}$. Notably, due to the plug-and-play nature of FRC, representations can be seamlessly integrated by cross-attention layers, allowing the diffusion model to be substituted with any other personalized models from the community.
  • Figure 4: Illustrations on task-specific representation and region assemblers, showing omni-representation decomposing of popular facial tasks. The representation here indicates the region feature $\bm{f}^{R}$, encompassing facial texture, hair and background, as inherited from \ref{['fig:intro']}. However, with more detailed divisions, facial texture is further separated into eyebrows, eyes, nose, lips, ears, and skin.
  • Figure 5: Qualitative comparison results on face reenactment.
  • ...and 7 more figures