Table of Contents
Fetching ...

TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model

Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Hang Zhou, Shengyi He, Zhiliang Xu, Haocheng Feng, Errui Ding, Jingdong Wang, Hongtao Xie, Youjian Zhao, Ziwei Liu

TL;DR

The Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework, which enables high-fidelity avatar reenactment from only short footage of monocular video, and carefully construct 2D and 3D structural information as intermediate guidance.

Abstract

Recently, 2D speaking avatars have increasingly participated in everyday scenarios due to the fast development of facial animation techniques. However, most existing works neglect the explicit control of human bodies. In this paper, we propose to drive not only the faces but also the torso and gesture movements of a speaking figure. Inspired by recent advances in diffusion models, we propose the Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework, which enables high-fidelity avatar reenactment from only short footage of monocular video. Our key idea is to enhance the textural awareness with explicit motion guidance in diffusion modeling. Specifically, we carefully construct 2D and 3D structural information as intermediate guidance. While recent diffusion models adopt a side network for control information injection, they fail to synthesize temporally stable results even with person-specific fine-tuning. We propose a Motion-Enhanced Textural Alignment module to enhance the bond between driving and target signals. Moreover, we build a Memory-based Hand-Recovering module to help with the difficulties in hand-shape preserving. After pre-training, our model can achieve high-fidelity 2D avatar reenactment with only 30 seconds of person-specific data. Extensive experiments demonstrate the effectiveness and superiority of our proposed framework. Resources can be found at https://guanjz20.github.io/projects/TALK-Act.

TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model

TL;DR

The Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework, which enables high-fidelity avatar reenactment from only short footage of monocular video, and carefully construct 2D and 3D structural information as intermediate guidance.

Abstract

Recently, 2D speaking avatars have increasingly participated in everyday scenarios due to the fast development of facial animation techniques. However, most existing works neglect the explicit control of human bodies. In this paper, we propose to drive not only the faces but also the torso and gesture movements of a speaking figure. Inspired by recent advances in diffusion models, we propose the Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework, which enables high-fidelity avatar reenactment from only short footage of monocular video. Our key idea is to enhance the textural awareness with explicit motion guidance in diffusion modeling. Specifically, we carefully construct 2D and 3D structural information as intermediate guidance. While recent diffusion models adopt a side network for control information injection, they fail to synthesize temporally stable results even with person-specific fine-tuning. We propose a Motion-Enhanced Textural Alignment module to enhance the bond between driving and target signals. Moreover, we build a Memory-based Hand-Recovering module to help with the difficulties in hand-shape preserving. After pre-training, our model can achieve high-fidelity 2D avatar reenactment with only 30 seconds of person-specific data. Extensive experiments demonstrate the effectiveness and superiority of our proposed framework. Resources can be found at https://guanjz20.github.io/projects/TALK-Act.

Paper Structure

This paper contains 34 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: TALK-Act Framework. The framework is depicted in five parts: 1) Model Input. The model input is shown on the left side. 2) Feature Encoding. Initial feature encoding is handled by three encoders and the proposed Motion-Enhanced Textural Alignment module. 3) Generation. Two branches, the Reference Branch and Denoising Branch, take input features to generate pose-aligned frames through the denoising process. 4) Model Output. The RGB outputs are then given by a decoder. 5) Block Details. We elaborate on detailed structures of the designed UNet layers. The driven subject is from © Charisma on Command.
  • Figure 2: Qualitative Comparisons. We compare SOTA methods on both self-driven (left) and cross-driven (right) settings. All methods utilize "Reference Frame" for appearance reconstruction, depicted in the first row. Driving video and motion guidance are shown in the first row as well. The driving signals of other methods are omitted here. Please zoom in for a better visualization of animation details. The driven subject on the left is from © Charisma on Command, and the driven subject on the right is from © Vanessa Van Edwards, Science of People.
  • Figure 3: Motion Guidance. We illustrate how to create a specific structural guidance from an RGB frame. The subject is from © Charisma on Command.
  • Figure 4: Visualization of Corresponding Matrix. From two positions of the driving motion, the two queried similarity heatmaps highlight a strong focus on the face and hand regions of the reference, respectively.
  • Figure 5: Ablations. Comparisons of several substitute designs. The driven subject is from © PATSahuja2020style (CC BY-NC 2.0)
  • ...and 3 more figures