Table of Contents
Fetching ...

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

Zhenhui Ye, Tianyun Zhong, Yi Ren, Ziyue Jiang, Jiawei Huang, Rongjie Huang, Jinglin Liu, Jinzheng He, Chen Zhang, Zehan Wang, Xize Chen, Xiang Yin, Zhou Zhao

TL;DR

The first attempt that exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG is proposed, which surpasses previous baselines regarding video quality, efficiency, and expressiveness.

Abstract

Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos. Personalized TFG is a variant that emphasizes the perceptual identity similarity of the synthesized result (from the perspective of appearance and talking style). While previous works typically solve this problem by learning an individual neural radiance field (NeRF) for each identity to implicitly store its static and dynamic information, we find it inefficient and non-generalized due to the per-identity-per-training framework and the limited training data. To this end, we propose MimicTalk, the first attempt that exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG. To be specific, (1) we first come up with a person-agnostic 3D TFG model as the base model and propose to adapt it into a specific identity; (2) we propose a static-dynamic-hybrid adaptation pipeline to help the model learn the personalized static appearance and facial dynamic features; (3) To generate the facial motion of the personalized talking style, we propose an in-context stylized audio-to-motion model that mimics the implicit talking style provided in the reference video without information loss by an explicit style representation. The adaptation process to an unseen identity can be performed in 15 minutes, which is 47 times faster than previous person-dependent methods. Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness. Source code and video samples are available at https://mimictalk.github.io .

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

TL;DR

The first attempt that exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG is proposed, which surpasses previous baselines regarding video quality, efficiency, and expressiveness.

Abstract

Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos. Personalized TFG is a variant that emphasizes the perceptual identity similarity of the synthesized result (from the perspective of appearance and talking style). While previous works typically solve this problem by learning an individual neural radiance field (NeRF) for each identity to implicitly store its static and dynamic information, we find it inefficient and non-generalized due to the per-identity-per-training framework and the limited training data. To this end, we propose MimicTalk, the first attempt that exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG. To be specific, (1) we first come up with a person-agnostic 3D TFG model as the base model and propose to adapt it into a specific identity; (2) we propose a static-dynamic-hybrid adaptation pipeline to help the model learn the personalized static appearance and facial dynamic features; (3) To generate the facial motion of the personalized talking style, we propose an in-context stylized audio-to-motion model that mimics the implicit talking style provided in the reference video without information loss by an explicit style representation. The adaptation process to an unseen identity can be performed in 15 minutes, which is 47 times faster than previous person-dependent methods. Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness. Source code and video samples are available at https://mimictalk.github.io .

Paper Structure

This paper contains 36 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The inference process of MimicTalk. We use an in-context stylized audio-to-motion model to produce expressive facial motion mimicking the talking style of a reference video. Then, a personalized renderer could render high-quality talking face videos that mimic the static and dynamic visual attributes of the target identity.
  • Figure 2: The training process of the personalized TFG renderer via the static-dynamic (SD)-hybrid adaptation pipeline. We adopt a pretrained one-shot person-agnostic 3D TFG model as the backbone, then fine-tune a person-dependent 3D face representation to memorize the static geometry and texture details. We also inject LoRA units into the backbone to learn the personalized dynamic features.
  • Figure 3: The process of in-context stylized motion prediction. For the training process please refer to Fig. \ref{['fig:audio_guided_motion_infilling']}.
  • Figure 4: Training/data efficiency of SD-Hybrid adaptation: CSIM results at different iterations and data scales. The baseline RAD-NeRF uses 180-second-long training samples and is updated for 250,000 iterations.
  • Figure 5: The detailed network structure of the person-agnostic renderer.
  • ...and 2 more figures