MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation
Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, Hujun Bao
TL;DR
This paper tackles the challenge of generating realistic talking heads with rich, controllable emotions from audio. It introduces MoEE, a Mixture of Emotion Experts architecture that decouples six basic emotions into dedicated experts and uses soft global-local gating to synthesize both single and compound emotions; it is complemented by the DH-FaceEmoVid-150 dataset and an emotion-to-latents module that unifies multimodal controls (audio, text, labels) into a shared emotion latent. The approach is implemented in a diffusion-based framework with a Reference Net, and employs a two-stage training regime, masked noisy emotion sampling, and AU-aware fine-grained control, achieving state-of-the-art results across HDTF, MEAD, and DH-FaceEmoVid-150 on multiple perceptual and synchronization metrics. The work demonstrates strong lip-sync, naturalness, and emotion controllability, and contributes a publicly releasable high-quality dataset to advance emotion-driven avatar generation systems.
Abstract
The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a) the absence of frameworks for modeling single basic emotional expressions, which restricts the generation of complex emotions such as compound emotions; b) the lack of comprehensive datasets rich in human emotional expressions, which limits the potential of models. To address these challenges, we propose the following innovations: 1) the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states; 2) the DH-FaceEmoVid-150 dataset, specifically curated to include six prevalent human emotional expressions as well as four types of compound emotions, thereby expanding the training potential of emotion-driven models. Furthermore, to enhance the flexibility of emotion control, we propose an emotion-to-latents module that leverages multimodal inputs, aligning diverse control signals-such as audio, text, and labels-to ensure more varied control inputs as well as the ability to control emotions using audio alone. Through extensive quantitative and qualitative evaluations, we demonstrate that the MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in generating complex emotional expressions and nuanced facial details, setting a new benchmark in the field. These datasets will be publicly released.
