Table of Contents
Fetching ...

MIMAFace: Face Animation via Motion-Identity Modulated Appearance Feature Learning

Yue Han, Junwei Zhu, Yuxiang Feng, Xiaozhong Ji, Keke He, Xiangtai Li, zhucun xue, Yong Liu

TL;DR

This work meticulously examines the essential appearance features in the facial animation tasks, and introduces a Motion-Identity Modulated Appearance Learning Module (MIA) that modulates CLIP features at both motion and identity levels, and designs an Inter-clip Affinity Learning Module (ICA) to model temporal relationships across clips.

Abstract

Current diffusion-based face animation methods generally adopt a ReferenceNet (a copy of U-Net) and a large amount of curated self-acquired data to learn appearance features, as robust appearance features are vital for ensuring temporal stability. However, when trained on public datasets, the results often exhibit a noticeable performance gap in image quality and temporal consistency. To address this issue, we meticulously examine the essential appearance features in the facial animation tasks, which include motion-agnostic (e.g., clothing, background) and motion-related (e.g., facial details) texture components, along with high-level discriminative identity features. Drawing from this analysis, we introduce a Motion-Identity Modulated Appearance Learning Module (MIA) that modulates CLIP features at both motion and identity levels. Additionally, to tackle the semantic/ color discontinuities between clips, we design an Inter-clip Affinity Learning Module (ICA) to model temporal relationships across clips. Our method achieves precise facial motion control (i.e., expressions and gaze), faithful identity preservation, and generates animation videos that maintain both intra/inter-clip temporal consistency. Moreover, it easily adapts to various modalities of driving sources. Extensive experiments demonstrate the superiority of our method.

MIMAFace: Face Animation via Motion-Identity Modulated Appearance Feature Learning

TL;DR

This work meticulously examines the essential appearance features in the facial animation tasks, and introduces a Motion-Identity Modulated Appearance Learning Module (MIA) that modulates CLIP features at both motion and identity levels, and designs an Inter-clip Affinity Learning Module (ICA) to model temporal relationships across clips.

Abstract

Current diffusion-based face animation methods generally adopt a ReferenceNet (a copy of U-Net) and a large amount of curated self-acquired data to learn appearance features, as robust appearance features are vital for ensuring temporal stability. However, when trained on public datasets, the results often exhibit a noticeable performance gap in image quality and temporal consistency. To address this issue, we meticulously examine the essential appearance features in the facial animation tasks, which include motion-agnostic (e.g., clothing, background) and motion-related (e.g., facial details) texture components, along with high-level discriminative identity features. Drawing from this analysis, we introduce a Motion-Identity Modulated Appearance Learning Module (MIA) that modulates CLIP features at both motion and identity levels. Additionally, to tackle the semantic/ color discontinuities between clips, we design an Inter-clip Affinity Learning Module (ICA) to model temporal relationships across clips. Our method achieves precise facial motion control (i.e., expressions and gaze), faithful identity preservation, and generates animation videos that maintain both intra/inter-clip temporal consistency. Moreover, it easily adapts to various modalities of driving sources. Extensive experiments demonstrate the superiority of our method.
Paper Structure (19 sections, 6 equations, 15 figures, 2 tables)

This paper contains 19 sections, 6 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Typical failure cases for current diffusion-based face animation methods: (1)/(2) semantic/ color discontinuity across clips, (3) stiff expression, (4) quality degradation
  • Figure 2: We compare our method to previous diffusion-based face animation methods in terms of appearance feature learning and inter-clip temporal consistency.
  • Figure 3: Pipeline of the proposed MIMAFace, which consists of: 1)Motion-Identity Modulated Appearance Learning Module (MIA) and 2)Inter-clip Affinity Learning Module (ICA). MIA modulates the appearance features at both motion and identity levels. The source image $\boldsymbol{I}_{S}$ is passed to CLIP $\boldsymbol{E}_{clip}$ to obtain patch tokens $\bm{e}_{tex}$ and a class token $\bm{e}_{id}$, which capture the texture and identity, respectively. $\bm{e}_{tex}$ are then modulated with motion coefficients $\boldsymbol{\rho}, \boldsymbol{\beta}, \boldsymbol{g}$ via cross attention. $\bm{e}_{id}$ is used to calculate the identity contrastive loss $\mathcal{L}_{id}$. The modulated $\bm{e}_{tex}$ are concatenated with $\bm{e}_{id}$ to form the conditioning appearance features. ICA ensures inter-clip temporal consistency by conditioning image latent $\boldsymbol{s}^{1: F}$ (of ground truth during training and denoised ones during inference) and indicating masks $\boldsymbol{m}^{1: F}$ with the added condition module $\boldsymbol{W}_{cond}$ . Additionally, we employ 3DMM coefficients $\boldsymbol{\rho}, \boldsymbol{\beta}$ and rendered images $\boldsymbol{I}_{R}$ as intermediate representations for motion. The 3DMM coefficients can adapt to various modalities of driving inputs, i.e., images, audio, and manual modifications.
  • Figure 4: Illustration of our Identity Contrastive Loss. We apply photometric data augmentation on the source image and maintain an ID token memory bank to store ID tokens. By pulling together positive token pairs with variances in pixels or structure and pushing apart the negative token pairs, the loss encourages the appearance encoder to capture high-level discriminative features.
  • Figure 5: Illustration of our Inter-clip Affinity Learning Module. The model learns inter-clip temporal consistency by conditioning the image latent of the preceding frames and using masks to indicate whether reconstruction is required.
  • ...and 10 more figures