Table of Contents
Fetching ...

LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space

Guanwen Feng, Zhihao Qian, Yunan Li, Siyu Jin, Qiguang Miao, Chi-Man Pun

TL;DR

This work introduces Linear Emotion Space (LES), a 41-dimensional, interpretable representation built on Facial Action Units to enable fine-grained, continuous emotion editing in talking head generation. LES separates into an Action Subspace $\mathbb{A}$ and an Isolation Subspace $\mathbb{I}$ to encode both facial actions and subtle emotion details, with emotion level varying linearly within LES. A Cross-Dimension Attention Net (CDAN) learns high-dimensional correlations between LES and 3DMM coefficients to drive controllable 3D facial deformations, supported by an Emotion Injector and multi-modal adaptation (AU source or audio) for robust synthesis. Experimental results on MEAD and other datasets show high visual quality, precise multi-emotion editing across multiple levels, and strong interpretability compared to state-of-the-art baselines, highlighting the approach’s potential for transparent, controllable digital humans.

Abstract

While existing one-shot talking head generation models have achieved progress in coarse-grained emotion editing, there is still a lack of fine-grained emotion editing models with high interpretability. We argue that for an approach to be considered fine-grained, it needs to provide clear definitions and sufficiently detailed differentiation. We present LES-Talker, a novel one-shot talking head generation model with high interpretability, to achieve fine-grained emotion editing across emotion types, emotion levels, and facial units. We propose a Linear Emotion Space (LES) definition based on Facial Action Units to characterize emotion transformations as vector transformations. We design the Cross-Dimension Attention Net (CDAN) to deeply mine the correlation between LES representation and 3D model representation. Through mining multiple relationships across different feature and structure dimensions, we enable LES representation to guide the controllable deformation of 3D model. In order to adapt the multimodal data with deviations to the LES and enhance visual quality, we utilize specialized network design and training strategies. Experiments show that our method provides high visual quality along with multilevel and interpretable fine-grained emotion editing, outperforming mainstream methods.

LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space

TL;DR

This work introduces Linear Emotion Space (LES), a 41-dimensional, interpretable representation built on Facial Action Units to enable fine-grained, continuous emotion editing in talking head generation. LES separates into an Action Subspace and an Isolation Subspace to encode both facial actions and subtle emotion details, with emotion level varying linearly within LES. A Cross-Dimension Attention Net (CDAN) learns high-dimensional correlations between LES and 3DMM coefficients to drive controllable 3D facial deformations, supported by an Emotion Injector and multi-modal adaptation (AU source or audio) for robust synthesis. Experimental results on MEAD and other datasets show high visual quality, precise multi-emotion editing across multiple levels, and strong interpretability compared to state-of-the-art baselines, highlighting the approach’s potential for transparent, controllable digital humans.

Abstract

While existing one-shot talking head generation models have achieved progress in coarse-grained emotion editing, there is still a lack of fine-grained emotion editing models with high interpretability. We argue that for an approach to be considered fine-grained, it needs to provide clear definitions and sufficiently detailed differentiation. We present LES-Talker, a novel one-shot talking head generation model with high interpretability, to achieve fine-grained emotion editing across emotion types, emotion levels, and facial units. We propose a Linear Emotion Space (LES) definition based on Facial Action Units to characterize emotion transformations as vector transformations. We design the Cross-Dimension Attention Net (CDAN) to deeply mine the correlation between LES representation and 3D model representation. Through mining multiple relationships across different feature and structure dimensions, we enable LES representation to guide the controllable deformation of 3D model. In order to adapt the multimodal data with deviations to the LES and enhance visual quality, we utilize specialized network design and training strategies. Experiments show that our method provides high visual quality along with multilevel and interpretable fine-grained emotion editing, outperforming mainstream methods.

Paper Structure

This paper contains 30 sections, 24 equations, 16 figures, 7 tables, 2 algorithms.

Figures (16)

  • Figure 1: Linear Emotion Space (LES) based on Facial Action Units (AUs) supports our LES-Talker model, offering exceptional interpretability. It enables fine-grained editing across 8 emotion types, 17 facial units, and continuous levels above 0. It can be driven by a lightweight use of video (requiring a sequence of images to provide the AU source) or by audio alone.
  • Figure 2: Pipeline of LES-Talker. Inputs include an identity image, audio, optional AU Source, and user editing targets. The Linear Emotion Space Recon. generates emotion vectors in LES. Emo Injector transforms these vectors based on user targets $(emo, level) \text{ or } (au, bias)$. Two levels of Cross-Dimension Attention Net (CDAN) process decomposed vectors $\boldsymbol{u}$ and $\boldsymbol{v}$ to produce 3D coefficients, optimized via Offset Decoder. These coefficients, along with identity information, create the rendered video.
  • Figure 3: Subspaces of Linear Emotion Space
  • Figure 4: Part of the Outlier Matrix.
  • Figure 5: Structure of the Cross-Dimension Attention Net. Illustration in the process for a single frame's coefficients. G denotes the vector outer product operation, and C denotes vector concatenation. $\boldsymbol{u}$,$\boldsymbol{v}$ are vectors in ACT and ISO Subspace respectively. $\beta$ is 3DMM coefficients.
  • ...and 11 more figures