Table of Contents
Fetching ...

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Yue Han, Junwei Zhu, Keke He, Xu Chen, Yanhao Ge, Wei Li, Xiangtai Li, Jiangning Zhang, Chengjie Wang, Yong Liu

TL;DR

The paper tackles efficient, high-fidelity face editing with pre-trained diffusion models by decoupling identity, motion, and attributes. It introduces Face-Adapter, consisting of Spatial Condition Generator, Identity Encoder, and Attribute Controller, enabling one model to perform both face reenactment and swapping. By freezing the U-Net and mapping identity into the diffusion text space along with guided conditional inpainting, it achieves strong performance on VoxCeleb and FF++, with robust background handling. It also discusses practical integration with StableDiffusion and ethical considerations such as potential misuse and watermarking for authenticity.

Abstract

Current face reenactment and swapping methods mainly rely on GAN frameworks, but recent focus has shifted to pre-trained diffusion models for their superior generation capabilities. However, training these models is resource-intensive, and the results have not yet achieved satisfactory performance levels. To address this issue, we introduce Face-Adapter, an efficient and effective adapter designed for high-precision and high-fidelity face editing for pre-trained diffusion models. We observe that both face reenactment/swapping tasks essentially involve combinations of target structure, ID and attribute. We aim to sufficiently decouple the control of these factors to achieve both tasks in one model. Specifically, our method contains: 1) A Spatial Condition Generator that provides precise landmarks and background; 2) A Plug-and-play Identity Encoder that transfers face embeddings to the text space by a transformer decoder. 3) An Attribute Controller that integrates spatial conditions and detailed attributes. Face-Adapter achieves comparable or even superior performance in terms of motion control precision, ID retention capability, and generation quality compared to fully fine-tuned face reenactment/swapping models. Additionally, Face-Adapter seamlessly integrates with various StableDiffusion models.

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

TL;DR

The paper tackles efficient, high-fidelity face editing with pre-trained diffusion models by decoupling identity, motion, and attributes. It introduces Face-Adapter, consisting of Spatial Condition Generator, Identity Encoder, and Attribute Controller, enabling one model to perform both face reenactment and swapping. By freezing the U-Net and mapping identity into the diffusion text space along with guided conditional inpainting, it achieves strong performance on VoxCeleb and FF++, with robust background handling. It also discusses practical integration with StableDiffusion and ethical considerations such as potential misuse and watermarking for authenticity.

Abstract

Current face reenactment and swapping methods mainly rely on GAN frameworks, but recent focus has shifted to pre-trained diffusion models for their superior generation capabilities. However, training these models is resource-intensive, and the results have not yet achieved satisfactory performance levels. To address this issue, we introduce Face-Adapter, an efficient and effective adapter designed for high-precision and high-fidelity face editing for pre-trained diffusion models. We observe that both face reenactment/swapping tasks essentially involve combinations of target structure, ID and attribute. We aim to sufficiently decouple the control of these factors to achieve both tasks in one model. Specifically, our method contains: 1) A Spatial Condition Generator that provides precise landmarks and background; 2) A Plug-and-play Identity Encoder that transfers face embeddings to the text space by a transformer decoder. 3) An Attribute Controller that integrates spatial conditions and detailed attributes. Face-Adapter achieves comparable or even superior performance in terms of motion control precision, ID retention capability, and generation quality compared to fully fine-tuned face reenactment/swapping models. Additionally, Face-Adapter seamlessly integrates with various StableDiffusion models.
Paper Structure (12 sections, 2 equations, 9 figures, 3 tables)

This paper contains 12 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Top: Face-Adapter supports a 'one-model-two-tasks' approach and demonstrates robustness under various challenging scenarios. Bottom: The design motivation is (1) Both face reenactment and swapping require fully disentangled ID, target structure, and attribute control; (2) Addressing overlooked issues unified in target structure; (3) Effective ID injection avoids SD fine-tuning, making Face-Adapter plug-and-play.
  • Figure 2: Overview pipeline of our proposed Face-Adapter that consists of three modules: 1) The Spatial Condition Generator predicts 3D prior landmarks and adapts the foreground mask automatically, offering more accurate guidance for controlled generation. 2) The Identity Encoder improves identity consistency in generated images by transferring face embeddings to the text space using learnable queries. 3) The Attribute Controller features (i) spatial control that combines target motion landmarks with the invariant background from the Spatial Condition Generator, and (ii) an attribute template to fill in missing attributes.
  • Figure 3: Background inconsistency between the input (i.e., source) and the groundtruth (i.e., target) makes the model confused and fail to learn to generate clear background. Thus, we provide the background of the target image in the spatial condition during training to address this inconsistency.
  • Figure 4: Comparisons with mask generated by pre-trained face parsing model (green) and $\varphi_{Re}$ (white). The green mask cannot fully cover the entire portrait.
  • Figure 5: Same-identity face reenactment results on Voxceleb2 test set. Our method faithfully reconstructs the background and facial details.
  • ...and 4 more figures