Table of Contents
Fetching ...

Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech

Yunji Chu, Yunseob Shim, Unsang Park

TL;DR

FEIM-TTS addresses the challenge of producing emotionally expressive, facially aligned speech in a zero-shot setting by integrating facial cues and emotion intensity into a diffusion-based TTS framework. It extends FACE-TTS with classifier-free diffusion guidance to condition generation on emotion labels and facial imagery, while enabling emotion-intensity control during sampling. Trained on CREMA-D, MELD, and LRS3, FEIM-TTS demonstrates controllable emotion expression and high naturalness, with MOS and SER evaluations supporting its effectiveness. The work highlights potential benefits for virtual characters and web content accessibility, including for visually impaired users, and provides samples at the project site.

Abstract

We propose FEIM-TTS, an innovative zero-shot text-to-speech (TTS) model that synthesizes emotionally expressive speech, aligned with facial images and modulated by emotion intensity. Leveraging deep learning, FEIM-TTS transcends traditional TTS systems by interpreting facial cues and adjusting to emotional nuances without dependence on labeled datasets. To address sparse audio-visual-emotional data, the model is trained using LRS3, CREMA-D, and MELD datasets, demonstrating its adaptability. FEIM-TTS's unique capability to produce high-quality, speaker-agnostic speech makes it suitable for creating adaptable voices for virtual characters. Moreover, FEIM-TTS significantly enhances accessibility for individuals with visual impairments or those who have trouble seeing. By integrating emotional nuances into TTS, our model enables dynamic and engaging auditory experiences for webcomics, allowing visually impaired users to enjoy these narratives more fully. Comprehensive evaluation evidences its proficiency in modulating emotion and intensity, advancing emotional speech synthesis and accessibility. Samples are available at: https://feim-tts.github.io/.

Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech

TL;DR

FEIM-TTS addresses the challenge of producing emotionally expressive, facially aligned speech in a zero-shot setting by integrating facial cues and emotion intensity into a diffusion-based TTS framework. It extends FACE-TTS with classifier-free diffusion guidance to condition generation on emotion labels and facial imagery, while enabling emotion-intensity control during sampling. Trained on CREMA-D, MELD, and LRS3, FEIM-TTS demonstrates controllable emotion expression and high naturalness, with MOS and SER evaluations supporting its effectiveness. The work highlights potential benefits for virtual characters and web content accessibility, including for visually impaired users, and provides samples at the project site.

Abstract

We propose FEIM-TTS, an innovative zero-shot text-to-speech (TTS) model that synthesizes emotionally expressive speech, aligned with facial images and modulated by emotion intensity. Leveraging deep learning, FEIM-TTS transcends traditional TTS systems by interpreting facial cues and adjusting to emotional nuances without dependence on labeled datasets. To address sparse audio-visual-emotional data, the model is trained using LRS3, CREMA-D, and MELD datasets, demonstrating its adaptability. FEIM-TTS's unique capability to produce high-quality, speaker-agnostic speech makes it suitable for creating adaptable voices for virtual characters. Moreover, FEIM-TTS significantly enhances accessibility for individuals with visual impairments or those who have trouble seeing. By integrating emotional nuances into TTS, our model enables dynamic and engaging auditory experiences for webcomics, allowing visually impaired users to enjoy these narratives more fully. Comprehensive evaluation evidences its proficiency in modulating emotion and intensity, advancing emotional speech synthesis and accessibility. Samples are available at: https://feim-tts.github.io/.
Paper Structure (18 sections, 6 equations, 3 figures, 4 tables)

This paper contains 18 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: FEIM-TTS Architecture. Leveraging an emotion label, a corresponding facial image, and a textual transcription, our model adeptly synthesizes a mel-spectrogram for expressive speech synthesis. FEIM-TTS, except for the audio network, is trained end-to-end using CREMA-D, MELD, and LRS3 data.
  • Figure 2: Graph illustrating the relationship between emotion intensity and the class prediction probability of the SER model. This graph shows that lower emotion intensity corresponds to lower prediction probabilities, while higher intensity results in higher probabilities, confirming effective emotion intensity modulation in the synthesized speech of FEIM-TTS.
  • Figure 3: Experimental results were obtained by matching face images with speech. Participants were asked to: (A) Given a face image, generate speech using both the FEIM-TTS and FACE-TTS models, then select which generated speech aligns better with the reference face; and (B) For one of two face images, generate speech using the FEIM-TTS model, then select which of the two face images best matches the generated speech.