Table of Contents
Fetching ...

Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation

Tianshui Chen, Jianman Lin, Zhijing Yang, Chumei Qing, Yukai Shi, Liang Lin

TL;DR

This work tackles the challenge of speech-preserving facial expression manipulation by decoupling content and emotion information. It introduces Contrastive Decoupled Representation Learning (CDRL), comprising CCRL for content priors from audio and CERL for emotion priors from a visual-language model, both guided by specialized contrastive losses. The decoupled representations serve as direct supervision signals during SPFEM training, leading to improved audio-lip synchronization and more accurate emotion manipulation, demonstrated on MEAD and RAVDESS datasets, with additional gains when combined with ASCCL. The approach offers robust generalization and provides a principled framework for leveraging cross-modal priors to enhance photorealistic facial editing while maintaining content integrity.

Abstract

Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents. Thus, emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. However, the intrinsic intertwining of these elements during the talking process poses challenges to their effectiveness as supervisory signals. In this work, we propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation via an innovative Contrastive Decoupled Representation Learning (CDRL) algorithm. Specifically, a Contrastive Content Representation Learning (CCRL) module is designed to learn audio feature, which primarily contains content information, as content priors to guide learning content representation from the source input. Meanwhile, a Contrastive Emotion Representation Learning (CERL) module is proposed to make use of a pre-trained visual-language model to learn emotion prior, which is then used to guide learning emotion representation from the reference input. We further introduce emotion-aware and emotion-augmented contrastive learning to train CCRL and CERL modules, respectively, ensuring learning emotion-independent content representation and content-independent emotion representation. During SPFEM model training, the decoupled content and emotion representations are used to supervise the generation process, ensuring more accurate emotion manipulation together with audio-lip synchronization. Extensive experiments and evaluations on various benchmarks show the effectiveness of the proposed algorithm.

Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation

TL;DR

This work tackles the challenge of speech-preserving facial expression manipulation by decoupling content and emotion information. It introduces Contrastive Decoupled Representation Learning (CDRL), comprising CCRL for content priors from audio and CERL for emotion priors from a visual-language model, both guided by specialized contrastive losses. The decoupled representations serve as direct supervision signals during SPFEM training, leading to improved audio-lip synchronization and more accurate emotion manipulation, demonstrated on MEAD and RAVDESS datasets, with additional gains when combined with ASCCL. The approach offers robust generalization and provides a principled framework for leveraging cross-modal priors to enhance photorealistic facial editing while maintaining content integrity.

Abstract

Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents. Thus, emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. However, the intrinsic intertwining of these elements during the talking process poses challenges to their effectiveness as supervisory signals. In this work, we propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation via an innovative Contrastive Decoupled Representation Learning (CDRL) algorithm. Specifically, a Contrastive Content Representation Learning (CCRL) module is designed to learn audio feature, which primarily contains content information, as content priors to guide learning content representation from the source input. Meanwhile, a Contrastive Emotion Representation Learning (CERL) module is proposed to make use of a pre-trained visual-language model to learn emotion prior, which is then used to guide learning emotion representation from the reference input. We further introduce emotion-aware and emotion-augmented contrastive learning to train CCRL and CERL modules, respectively, ensuring learning emotion-independent content representation and content-independent emotion representation. During SPFEM model training, the decoupled content and emotion representations are used to supervise the generation process, ensuring more accurate emotion manipulation together with audio-lip synchronization. Extensive experiments and evaluations on various benchmarks show the effectiveness of the proposed algorithm.

Paper Structure

This paper contains 21 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: An overall pipeline of incorporating the proposed CDRL algorithm to supervise learning SPFEM models. It consists of the CCRL and CERL modules. CCRL utilizes the audio corresponding to the source input ($I_s$) as content prior to decoupling content representation from both the source input ($I_s$) and the generated output ($I_g$), ensuring aligned content generation. CERL employs the learned emotion prior for decoupling emotions from the reference input ($I_r$) and the generated output ($I_g$), facilitating consistent emotion generation.
  • Figure 2: An illustration of CCRL module. It utilizes the audio clip to guide learning content representation through a cross-attention mechanism equipped with an emotion-aware contrastive loss. In this context, The image encoder $\Phi(\cdot)$ combines the pretrained ArcFace $E_I(\cdot)$deng2019arcface and the mapping operation $M(\cdot)$, while $E_c(\cdot)$ consists of $\Phi(\cdot)$ and a cross-attention mechanism.
  • Figure 3: An illustration of CERL module. It uses a pre-trained visual-language model with prompt tuning to learn emotion priors and exploits the priors to guide learning emotion representation with a simple correlation operation supervised by an emotion-augmented contrastive loss. $E_e(\cdot)$ includes image feature extraction and a dot product with the emotion prior.
  • Figure 4: Qualitative comparisons of NED with and without the proposed algorithm. Left half: The samples are selected from the MEAD dataset. Right half: The samples are selected from the RAVDESS dataset.
  • Figure 5: Qualitative comparisons of ICface with and without the proposed algorithm. Left half: The samples are selected from the MEAD dataset. Right half: The samples are selected from the MEAD dataset.
  • ...and 5 more figures