EmoHead: Emotional Talking Head via Manipulating Semantic Expression Parameters
Xuli Shen, Hua Cai, Dingding Yu, Weilin Shen, Qing Xu, Xiangyang Xue
TL;DR
EmoHead addresses the challenge of emotion-controlled talking head synthesis by introducing semantic expression parameters and an audio-expression module that maps multi-modal audio features to a low-dimensional expression space $\boldsymbol{\alpha} \in \mathbb{R}^m$. Emotion consistency is achieved through an Audio Expression Alignment mechanism that fuses audio, emotion, and text features, followed by emotion-specific hyperplanes that refine $\boldsymbol{\alpha}$ along emotion normals. The rendering stage uses a NeRF-based implicit function conditioned on the refined parameters $\hat{\boldsymbol{\alpha}}$ to produce high-fidelity frames, with two-stage training incorporating CLIP-based alignment and identity preservation losses. Experimental results on MEAD and CREMA-D show improved reconstruction quality, lip-sync accuracy, and emotion fidelity compared to baselines, demonstrating the practical impact of semantic parameter disentanglement and hyperplane refinement for controllable emotional avatar synthesis.
Abstract
Generating emotion-specific talking head videos from audio input is an important and complex challenge for human-machine interaction. However, emotion is highly abstract concept with ambiguous boundaries, and it necessitates disentangled expression parameters to generate emotionally expressive talking head videos. In this work, we present EmoHead to synthesize talking head videos via semantic expression parameters. To predict expression parameter for arbitrary audio input, we apply an audio-expression module that can be specified by an emotion tag. This module aims to enhance correlation from audio input across various emotions. Furthermore, we leverage pre-trained hyperplane to refine facial movements by probing along the vertical direction. Finally, the refined expression parameters regularize neural radiance fields and facilitate the emotion-consistent generation of talking head videos. Experimental results demonstrate that semantic expression parameters lead to better reconstruction quality and controllability.
