Table of Contents
Fetching ...

GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression

Ziqi Zhou, Weize Quan, Hailin Shi, Wei Li, Lili Wang, Dong-Ming Yan

TL;DR

GoHD tackles the challenge of audio-driven talking-face generation across diverse identities with limited training data by integrating a gaze-aware latent-space animator, a prosody-sensitive diffusion model for rhythmic head poses, and a two-stage expression predictor that separates lip synchronization from temporally driven eye motions. The approach enables robust generalization to unseen subjects, controllable gaze, and multi-modal driving using intermediate motion descriptors. Quantitative and qualitative evaluations show GoHD achieves competitive lip-sync, improved pose realism, and natural eye motions, with a user study favoring its overall naturalness. The work offers practical impact for avatar creation and teleconferencing by delivering realistic, controllable, and data-efficient portrait animation.

Abstract

Audio-driven talking head generation necessitates seamless integration of audio and visual data amidst the challenges posed by diverse input portraits and intricate correlations between audio and facial motions. In response, we propose a robust framework GoHD designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion. GoHD innovates with three key modules: Firstly, an animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles. This module achieves high disentanglement of motion and identity, and it also incorporates gaze orientation to rectify unnatural eye movements that were previously overlooked. Secondly, a conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody. Thirdly, to estimate lip-synchronized and realistic expressions from the input audio within limited training data, a two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions, e.g., blinks and frowns. Extensive experiments validate GoHD's advanced generalization capabilities, demonstrating its effectiveness in generating realistic talking face results on arbitrary subjects.

GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression

TL;DR

GoHD tackles the challenge of audio-driven talking-face generation across diverse identities with limited training data by integrating a gaze-aware latent-space animator, a prosody-sensitive diffusion model for rhythmic head poses, and a two-stage expression predictor that separates lip synchronization from temporally driven eye motions. The approach enables robust generalization to unseen subjects, controllable gaze, and multi-modal driving using intermediate motion descriptors. Quantitative and qualitative evaluations show GoHD achieves competitive lip-sync, improved pose realism, and natural eye motions, with a user study favoring its overall naturalness. The work offers practical impact for avatar creation and teleconferencing by delivering realistic, controllable, and data-efficient portrait animation.

Abstract

Audio-driven talking head generation necessitates seamless integration of audio and visual data amidst the challenges posed by diverse input portraits and intricate correlations between audio and facial motions. In response, we propose a robust framework GoHD designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion. GoHD innovates with three key modules: Firstly, an animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles. This module achieves high disentanglement of motion and identity, and it also incorporates gaze orientation to rectify unnatural eye movements that were previously overlooked. Secondly, a conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody. Thirdly, to estimate lip-synchronized and realistic expressions from the input audio within limited training data, a two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions, e.g., blinks and frowns. Extensive experiments validate GoHD's advanced generalization capabilities, demonstrating its effectiveness in generating realistic talking face results on arbitrary subjects.

Paper Structure

This paper contains 25 sections, 18 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Illustration of gaze orientation experiments. The results of two identities driven by the same audio clip and different gaze directions are presented. The true pitch and yaw angles are multiplied by $\pi$.
  • Figure 2: Illustration of our proposed GoHD, which is a highly disentangled and controllable taking face generation framework as described at the beginning of Section 3.
  • Figure 3: Demonstration of the residual denoising network architecture in the diffusion model for head pose estimation.
  • Figure 4: Definition of the eye motion feature, where $\boldsymbol{bl}_{t}$ represents the eye-blinking ratio of the t-th frame, with $\boldsymbol{he}_{t/r} \in \mathbb{R}$ denoting the average heights of eyes. $\boldsymbol{fr}_t \in \mathbb{R}^{20}$ symbolizes the corresponding brow displacements, and $flat$ means the operation of flattening. The landmark indices and calculation for ${\boldsymbol{he}}_{t/r}$ are illustrated on the right side.
  • Figure 5: The expression predictor trained in two stages.
  • ...and 8 more figures