Table of Contents
Fetching ...

MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation

Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, Hujun Bao

TL;DR

This paper tackles the challenge of generating realistic talking heads with rich, controllable emotions from audio. It introduces MoEE, a Mixture of Emotion Experts architecture that decouples six basic emotions into dedicated experts and uses soft global-local gating to synthesize both single and compound emotions; it is complemented by the DH-FaceEmoVid-150 dataset and an emotion-to-latents module that unifies multimodal controls (audio, text, labels) into a shared emotion latent. The approach is implemented in a diffusion-based framework with a Reference Net, and employs a two-stage training regime, masked noisy emotion sampling, and AU-aware fine-grained control, achieving state-of-the-art results across HDTF, MEAD, and DH-FaceEmoVid-150 on multiple perceptual and synchronization metrics. The work demonstrates strong lip-sync, naturalness, and emotion controllability, and contributes a publicly releasable high-quality dataset to advance emotion-driven avatar generation systems.

Abstract

The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a) the absence of frameworks for modeling single basic emotional expressions, which restricts the generation of complex emotions such as compound emotions; b) the lack of comprehensive datasets rich in human emotional expressions, which limits the potential of models. To address these challenges, we propose the following innovations: 1) the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states; 2) the DH-FaceEmoVid-150 dataset, specifically curated to include six prevalent human emotional expressions as well as four types of compound emotions, thereby expanding the training potential of emotion-driven models. Furthermore, to enhance the flexibility of emotion control, we propose an emotion-to-latents module that leverages multimodal inputs, aligning diverse control signals-such as audio, text, and labels-to ensure more varied control inputs as well as the ability to control emotions using audio alone. Through extensive quantitative and qualitative evaluations, we demonstrate that the MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in generating complex emotional expressions and nuanced facial details, setting a new benchmark in the field. These datasets will be publicly released.

MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation

TL;DR

This paper tackles the challenge of generating realistic talking heads with rich, controllable emotions from audio. It introduces MoEE, a Mixture of Emotion Experts architecture that decouples six basic emotions into dedicated experts and uses soft global-local gating to synthesize both single and compound emotions; it is complemented by the DH-FaceEmoVid-150 dataset and an emotion-to-latents module that unifies multimodal controls (audio, text, labels) into a shared emotion latent. The approach is implemented in a diffusion-based framework with a Reference Net, and employs a two-stage training regime, masked noisy emotion sampling, and AU-aware fine-grained control, achieving state-of-the-art results across HDTF, MEAD, and DH-FaceEmoVid-150 on multiple perceptual and synchronization metrics. The work demonstrates strong lip-sync, naturalness, and emotion controllability, and contributes a publicly releasable high-quality dataset to advance emotion-driven avatar generation systems.

Abstract

The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a) the absence of frameworks for modeling single basic emotional expressions, which restricts the generation of complex emotions such as compound emotions; b) the lack of comprehensive datasets rich in human emotional expressions, which limits the potential of models. To address these challenges, we propose the following innovations: 1) the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states; 2) the DH-FaceEmoVid-150 dataset, specifically curated to include six prevalent human emotional expressions as well as four types of compound emotions, thereby expanding the training potential of emotion-driven models. Furthermore, to enhance the flexibility of emotion control, we propose an emotion-to-latents module that leverages multimodal inputs, aligning diverse control signals-such as audio, text, and labels-to ensure more varied control inputs as well as the ability to control emotions using audio alone. Through extensive quantitative and qualitative evaluations, we demonstrate that the MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in generating complex emotional expressions and nuanced facial details, setting a new benchmark in the field. These datasets will be publicly released.
Paper Structure (20 sections, 7 equations, 15 figures, 7 tables)

This paper contains 20 sections, 7 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: MoEE enables more natural and vivid basic emotion control and compound emotion control, through labels or text prompts in the generated talking face. It can also achieve emotion control based solely on audio with emotional cues. Beyond coarse-grained emotion control (e.g., audio, label and text prompt), our method allows for fine-grained expression control through AU labels.
  • Figure 2: Showcase of publicly available datasets and our proposed datasets: We refer to datasets like (a) DH-FaceEmoVid-150 and (b) MEAD as emotion datasets.
  • Figure 3: First, we filter the emotion dataset based on emotion intensity. Then, to achieve fine-grained control, we extract the AUs and prompt GPT-4V to paraphrase them into a sentence.
  • Figure 4: The overall framework of MoEE implements a two-stage training process. First, we fine-tune the Reference Net and denoising U-Net modules on emotion datasets to enable the model to learn as much prior knowledge about expressive faces as possible. Then, we achieve more natural and accurate emotion and expression control through the Mixture of Emotion Experts. Additionally, the Emotion-to-Latents module enables multi-modal emotion control.
  • Figure 5: Visualization of different emotion sample strategy. Results demonstrate that the proposed masked noisy emotion sample strategy can ensure natural and vivid expression.
  • ...and 10 more figures