FaceChain: A Playground for Human-centric Artificial Intelligence Generated Content

Yang Liu; Cheng Yu; Lei Shang; Yongyi He; Ziheng Wu; Xingjun Wang; Chao Xu; Haoyu Xie; Weida Wang; Yuze Zhao; Lin Zhu; Chen Cheng; Weitao Chen; Yuan Yao; Wenmeng Zhou; Jiaqi Xu; Qiang Wang; Yingda Chen; Xuansong Xie; Baigui Sun

FaceChain: A Playground for Human-centric Artificial Intelligence Generated Content

Yang Liu, Cheng Yu, Lei Shang, Yongyi He, Ziheng Wu, Xingjun Wang, Chao Xu, Haoyu Xie, Weida Wang, Yuze Zhao, Lin Zhu, Chen Cheng, Weitao Chen, Yuan Yao, Wenmeng Zhou, Jiaqi Xu, Qiang Wang, Yingda Chen, Xuansong Xie, Baigui Sun

TL;DR

FaceChain presents a modular, human-centric portrait-generation framework built on Stable Diffusion that fuses style- and face-specific LoRA adapters with a rich set of face-perception models to preserve identity from limited inputs. The system combines rigorous data processing, targeted label tagging, and a two-stage inpainting pipeline with multi-ControlNet guidance to maintain facial structure and realism, followed by post-processing steps for template-based fusion and robust similarity ranking. It enables practical applications such as virtual try-on and talking-head video generation, while offering an extensive, extensible library of style-LoRA models for infinite stylistic variation. The work contributes a configurable, open-source pipeline and demonstrates real-world utility, with future directions including support for multiple subjects, adaptive fusion, and train-free personalization frameworks.

Abstract

Recent advancement in personalized image generation have unveiled the intriguing capability of pre-trained text-to-image models on learning identity information from a collection of portrait images. However, existing solutions are vulnerable in producing truthful details, and usually suffer from several defects such as (i) The generated face exhibit its own unique characteristics, \ie facial shape and facial feature positioning may not resemble key characteristics of the input, and (ii) The synthesized face may contain warped, blurred or corrupted regions. In this paper, we present FaceChain, a personalized portrait generation framework that combines a series of customized image-generation model and a rich set of face-related perceptual understanding models (\eg, face detection, deep face embedding extraction, and facial attribute recognition), to tackle aforementioned challenges and to generate truthful personalized portraits, with only a handful of portrait images as input. Concretely, we inject several SOTA face models into the generation procedure, achieving a more efficient label-tagging, data-processing, and model post-processing compared to previous solutions, such as DreamBooth ~\cite{ruiz2023dreambooth} , InstantBooth ~\cite{shi2023instantbooth} , or other LoRA-only approaches ~\cite{hu2021lora} . Besides, based on FaceChain, we further develop several applications to build a broader playground for better showing its value, including virtual try on and 2D talking head. We hope it can grow to serve the burgeoning needs from the communities. Note that this is an ongoing work that will be consistently refined and improved upon. FaceChain is open-sourced under Apache-2.0 license at \url{https://github.com/modelscope/facechain}.

FaceChain: A Playground for Human-centric Artificial Intelligence Generated Content

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 4 figures, 1 table)

This paper contains 17 sections, 1 equation, 4 figures, 1 table.

Introduction
Architecture
Data Processing
Face Extraction
Label Tagging
Model Training
Model Inference
Model Post Processing
Inpainting
First Text-to-image Inference
Second Inpainting Inference
Multiple User IDs
Infinite Style
FaceChain Application
Virtual Try-on
...and 2 more sections

Figures (4)

Figure 1: Architectural overview of FaceChain personalized portrait generation. During training, multiple data processing approaches are adopted to generate tagged face images to train face-LoRA model online. The weights of face-LoRA and style-LoRA models are then merged into foundation Stable-Diffusion model for text-to-image generation. The generated portraits go through post-face-fusion and ranking before returning to users.
Figure 2: Architectural overview of the inpainting pipeline of FaceChain personalized portrait generation. We first generate the face image through text-to-image inference guided by the bone pose of the template image. Then the warped face is used to extract the face landmarks to improve facial structure preservation. The final portraits are generated through the second inpainting inference.
Figure 3: Generated results with various style-LoRA models
Figure 4: Inference pipeline of talking head. SadTalker uses the coefficients of 3DMM as intermediate motion representation. So given single image and audio, Monocular 3D Face Reconstruction, PoseVAE and ExpNet will use them to generate realistic 3D motion coefficients (facial expression $\beta$, head pose $\rho$), then these coefficients are used to implicitly modulate the 3D-aware face render for final video generation.

FaceChain: A Playground for Human-centric Artificial Intelligence Generated Content

TL;DR

Abstract

FaceChain: A Playground for Human-centric Artificial Intelligence Generated Content

Authors

TL;DR

Abstract

Table of Contents

Figures (4)