FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

Chao Xu; Yang Liu; Jiazheng Xing; Weida Wang; Mingze Sun; Jun Dan; Tianxin Huang; Siyuan Li; Zhi-Qi Cheng; Ying Tai; Baigui Sun

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

Chao Xu, Yang Liu, Jiazheng Xing, Weida Wang, Mingze Sun, Jun Dan, Tianxin Huang, Siyuan Li, Zhi-Qi Cheng, Ying Tai, Baigui Sun

TL;DR

This paper abstracts the process of people hearing speech, extracting meaningful cues, and creating vari-ous dynamically audio-consistent talking faces, termed Lis-tening and Imagining, into the task of high-fidelity diverse talking faces generation from a single audio, and intro-duce the Controllable Coherent Frame generation.

Abstract

In this paper, we abstract the process of people hearing speech, extracting meaningful cues, and creating various dynamically audio-consistent talking faces, termed Listening and Imagining, into the task of high-fidelity diverse talking faces generation from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. To tackle the issues, we first dig out the intricate relationships among facial factors and simplify the decoupling process, tailoring a Progressive Audio Disentanglement for accurate facial geometry and semantics learning, where each stage incorporates a customized training module responsible for a specific factor. Secondly, to achieve visually diverse and audio-synchronized animation solely from input audio within a single model, we introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models (LDMs) to focus on maintaining facial geometry and semantics, as well as texture and temporal coherence between frames. In this way, we inherit high-quality diverse generation from LDMs while significantly improving their controllability at a low training cost. Extensive experiments demonstrate the flexibility and effectiveness of our method in handling this paradigm. The codes will be released at https://github.com/modelscope/facechain.

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

TL;DR

Abstract

Paper Structure (14 sections, 7 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 7 equations, 9 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Audio-driven Talking Face Generation
Audio to Face Generation
Diffusion Models
Method
Progressive Audio Disentanglement
Controllable Coherent Frame Generation
Experiments
Experimental Setup
Comparison with State-of-the-Art Methods.
Further Analysis
Ablation Study and Efficiency Evaluation
Conclusions

Figures (9)

Figure 1: Overview of the proposed method. Our approach involves a two-stage framework that corresponds to the Listening and Imagining. For listening, the Progressive Audio Disentanglement gradually separates the identity, content, and emotion from the entangled audio. For Imagining, Controllable Coherent Frame generation receives the facial semantics ($\boldsymbol{\theta}_{id}$ and $\boldsymbol{\theta}_{e}$ inferred from PAD) and geometry (3D mesh $\boldsymbol{I}_{rd}$ rendered from $\hat{\boldsymbol{\alpha}}$, $\hat{\boldsymbol{\beta}}$ and other coefficients extracted from $\boldsymbol{I}$) to synthesize the diverse audio-synchronized faces, while the $\boldsymbol{I}_{ad}$, $\boldsymbol{I}_{id}$, and $\boldsymbol{I}_{bg}$ are further introduced to achieve highly controllable generation with complete visual and temporal consistency. Please refer to Alg. \ref{['alg:ai']} for more details. In this way, we achieve diverse and high-fidelity face animation solely from audio.
Figure 2: Qualitative results of audio-to-face on MEAD . Icons of the same color indicate samples from the same audio. Ours-I, -C, and -E mean the stage of identity, content, emotion decoupling.
Figure 3: Visual comparison with recent SOTA methods. Images are from officially released codes for fair comparisons. The first sample selected from HDTF, second from MEAD. For the third, based on the audio of the first row, our method generate a unseen face for all competitors, which using this as the source face to produce talking faces. The first row provides the ground truth for facial expression.
Figure 4: Illustration of disentangled controllability. (a) is under the diverse mode and (b) is under the coherent mode.
Figure 5: Illustration of the image retrieval using the identity semantic features. The faces (column 1) only shown for reference.
...and 4 more figures

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

TL;DR

Abstract

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

Authors

TL;DR

Abstract

Table of Contents

Figures (9)