Table of Contents
Fetching ...

EmoHead: Emotional Talking Head via Manipulating Semantic Expression Parameters

Xuli Shen, Hua Cai, Dingding Yu, Weilin Shen, Qing Xu, Xiangyang Xue

TL;DR

EmoHead addresses the challenge of emotion-controlled talking head synthesis by introducing semantic expression parameters and an audio-expression module that maps multi-modal audio features to a low-dimensional expression space $\boldsymbol{\alpha} \in \mathbb{R}^m$. Emotion consistency is achieved through an Audio Expression Alignment mechanism that fuses audio, emotion, and text features, followed by emotion-specific hyperplanes that refine $\boldsymbol{\alpha}$ along emotion normals. The rendering stage uses a NeRF-based implicit function conditioned on the refined parameters $\hat{\boldsymbol{\alpha}}$ to produce high-fidelity frames, with two-stage training incorporating CLIP-based alignment and identity preservation losses. Experimental results on MEAD and CREMA-D show improved reconstruction quality, lip-sync accuracy, and emotion fidelity compared to baselines, demonstrating the practical impact of semantic parameter disentanglement and hyperplane refinement for controllable emotional avatar synthesis.

Abstract

Generating emotion-specific talking head videos from audio input is an important and complex challenge for human-machine interaction. However, emotion is highly abstract concept with ambiguous boundaries, and it necessitates disentangled expression parameters to generate emotionally expressive talking head videos. In this work, we present EmoHead to synthesize talking head videos via semantic expression parameters. To predict expression parameter for arbitrary audio input, we apply an audio-expression module that can be specified by an emotion tag. This module aims to enhance correlation from audio input across various emotions. Furthermore, we leverage pre-trained hyperplane to refine facial movements by probing along the vertical direction. Finally, the refined expression parameters regularize neural radiance fields and facilitate the emotion-consistent generation of talking head videos. Experimental results demonstrate that semantic expression parameters lead to better reconstruction quality and controllability.

EmoHead: Emotional Talking Head via Manipulating Semantic Expression Parameters

TL;DR

EmoHead addresses the challenge of emotion-controlled talking head synthesis by introducing semantic expression parameters and an audio-expression module that maps multi-modal audio features to a low-dimensional expression space . Emotion consistency is achieved through an Audio Expression Alignment mechanism that fuses audio, emotion, and text features, followed by emotion-specific hyperplanes that refine along emotion normals. The rendering stage uses a NeRF-based implicit function conditioned on the refined parameters to produce high-fidelity frames, with two-stage training incorporating CLIP-based alignment and identity preservation losses. Experimental results on MEAD and CREMA-D show improved reconstruction quality, lip-sync accuracy, and emotion fidelity compared to baselines, demonstrating the practical impact of semantic parameter disentanglement and hyperplane refinement for controllable emotional avatar synthesis.

Abstract

Generating emotion-specific talking head videos from audio input is an important and complex challenge for human-machine interaction. However, emotion is highly abstract concept with ambiguous boundaries, and it necessitates disentangled expression parameters to generate emotionally expressive talking head videos. In this work, we present EmoHead to synthesize talking head videos via semantic expression parameters. To predict expression parameter for arbitrary audio input, we apply an audio-expression module that can be specified by an emotion tag. This module aims to enhance correlation from audio input across various emotions. Furthermore, we leverage pre-trained hyperplane to refine facial movements by probing along the vertical direction. Finally, the refined expression parameters regularize neural radiance fields and facilitate the emotion-consistent generation of talking head videos. Experimental results demonstrate that semantic expression parameters lead to better reconstruction quality and controllability.

Paper Structure

This paper contains 20 sections, 10 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Different from previous work, the proposed method applies emotion-specific hyperplanes to eliminate "expression collapse" phenomenon and generate target emotional videos.
  • Figure 2: The proposed framework of EmoHead.
  • Figure 3: Visualization comparisons of generated frames with state-of-the-art methods. See the illustration of color in Sec.\ref{['quaa']}.
  • Figure 4: (A) Continuous expression manipulation in talking stage. (B) Reconstruction result of unseen views and varying emotion. Please refer to the supplementary video.
  • Figure 5: Figure A displays the monocular reconstruction by changing the value of each dimension for the expression parameters of 3DMM. Emotions can hardly be reconstructed using facial expression parameters. Most dimensions represent "unclear" emotions, as depicted in the green boxes. Figure B shows the continuous expression change in a single dimension through the range $[-1.8, 1.8]$.
  • ...and 6 more figures