Table of Contents
Fetching ...

EmoGene: Audio-Driven Emotional 3D Talking-Head Generation

Wenqing Wang, Yun Fu

TL;DR

EmoGene, a novel framework for synthesizing highfidelity, audio-driven video portraits with accurate emotional expressions, is introduced, which outperforms previous methods in generating highfidelity emotional talking-head videos.

Abstract

Audio-driven talking-head generation is a crucial and useful technology for virtual human interaction and film-making. While recent advances have focused on improving image fidelity and lip synchronization, generating accurate emotional expressions remains underexplored. In this paper, we introduce EmoGene, a novel framework for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions. Our approach employs a variational autoencoder (VAE)-based audio-to-motion module to generate facial landmarks, which are concatenated with emotional embedding in a motion-to-emotion module to produce emotional landmarks. These landmarks drive a Neural Radiance Fields (NeRF)-based emotion-to-video module to render realistic emotional talking-head videos. Additionally, we propose a pose sampling method to generate natural idle-state (non-speaking) videos for silent audio inputs. Extensive experiments demonstrate that EmoGene outperforms previous methods in generating high-fidelity emotional talking-head videos.

EmoGene: Audio-Driven Emotional 3D Talking-Head Generation

TL;DR

EmoGene, a novel framework for synthesizing highfidelity, audio-driven video portraits with accurate emotional expressions, is introduced, which outperforms previous methods in generating highfidelity emotional talking-head videos.

Abstract

Audio-driven talking-head generation is a crucial and useful technology for virtual human interaction and film-making. While recent advances have focused on improving image fidelity and lip synchronization, generating accurate emotional expressions remains underexplored. In this paper, we introduce EmoGene, a novel framework for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions. Our approach employs a variational autoencoder (VAE)-based audio-to-motion module to generate facial landmarks, which are concatenated with emotional embedding in a motion-to-emotion module to produce emotional landmarks. These landmarks drive a Neural Radiance Fields (NeRF)-based emotion-to-video module to render realistic emotional talking-head videos. Additionally, we propose a pose sampling method to generate natural idle-state (non-speaking) videos for silent audio inputs. Extensive experiments demonstrate that EmoGene outperforms previous methods in generating high-fidelity emotional talking-head videos.

Paper Structure

This paper contains 16 sections, 10 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: EmoGene pipeline. 1) The Audio-to-Motion module converts input audio features into neutral facial landmarks. 2) The Motion-to-Emotion module transforms these landmarks into emotional landmarks based on the emotion label. 3) The Emotion-to-Video module generates the emotional talking-head video conditioned on the emotional landmarks.
  • Figure 2: The overview of the audio-to-motion module. The dashed arrow indicates that the process occurs only during training.
  • Figure 3: The overview of the emotion-to-video module. The NeRF models are trained to render the talking-head video from the driving landmarks.
  • Figure 4: The overview of motion-to-emotion module. The dashed arrow indicates that the process occurs only during training.
  • Figure 5: New pose tensor construction. To construct the new pose tensor, the idle pose segments are inserted after their corresponding non-idle pose tensors.
  • ...and 3 more figures