Table of Contents
Fetching ...

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts

Zhi-Qi Cheng, Xiang Li, Jun-Yan He, Junyao Chen, Xiaomao Fan, Xiaojiang Peng, Alexander G. Hauptmann

TL;DR

UMETTS tackles the challenge of expressing true emotional nuance in TTS by leveraging multimodal cues from text, audio, and visuals. It introduces two key components: EP-Align, a contrastive learning framework that aligns multimodal emotional representations into a unified embedding, and EMI-TTS, which conditions state-of-the-art TTS models on these embeddings. The framework supports multiple TTS backends (VITS, FastSpeech2, Tacotron2) and demonstrates superior objective and subjective performance on MELD, MEAD, ESD, and RAF-DB datasets, including improved WER/CER, MCD, SECS, and MOS scores. These results suggest significant practical impact for emotionally rich speech in HCI, entertainment, education, and assistance systems, with an open-source release to foster further research.

Abstract

Emotional Text-to-Speech (E-TTS) synthesis has garnered significant attention in recent years due to its potential to revolutionize human-computer interaction. However, current E-TTS approaches often struggle to capture the intricacies of human emotions, primarily relying on oversimplified emotional labels or single-modality input. In this paper, we introduce the Unified Multimodal Prompt-Induced Emotional Text-to-Speech System (UMETTS), a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. The core of UMETTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS Module (EMI-TTS). (1) EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information. (2) Subsequently, EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations show that UMETTS achieves significant improvements in emotion accuracy and speech naturalness, outperforming traditional E-TTS methods on both objective and subjective metrics.

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts

TL;DR

UMETTS tackles the challenge of expressing true emotional nuance in TTS by leveraging multimodal cues from text, audio, and visuals. It introduces two key components: EP-Align, a contrastive learning framework that aligns multimodal emotional representations into a unified embedding, and EMI-TTS, which conditions state-of-the-art TTS models on these embeddings. The framework supports multiple TTS backends (VITS, FastSpeech2, Tacotron2) and demonstrates superior objective and subjective performance on MELD, MEAD, ESD, and RAF-DB datasets, including improved WER/CER, MCD, SECS, and MOS scores. These results suggest significant practical impact for emotionally rich speech in HCI, entertainment, education, and assistance systems, with an open-source release to foster further research.

Abstract

Emotional Text-to-Speech (E-TTS) synthesis has garnered significant attention in recent years due to its potential to revolutionize human-computer interaction. However, current E-TTS approaches often struggle to capture the intricacies of human emotions, primarily relying on oversimplified emotional labels or single-modality input. In this paper, we introduce the Unified Multimodal Prompt-Induced Emotional Text-to-Speech System (UMETTS), a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. The core of UMETTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS Module (EMI-TTS). (1) EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information. (2) Subsequently, EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations show that UMETTS achieves significant improvements in emotion accuracy and speech naturalness, outperforming traditional E-TTS methods on both objective and subjective metrics.
Paper Structure (18 sections, 3 equations, 3 figures, 2 tables)

This paper contains 18 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Top: The UMETTS framework adeptly synthesizes speech by incorporating emotional cues from multiple modalities, ensuring the output speech consistently conveys the intended emotions. This capability is exemplified through our multimodal references, accessible by clicking the respective links: https://github.com/KTTRCDL/UMETTS/tree/main/demo/static/demo_image/reference_image.png, https://github.com/KTTRCDL/UMETTS/tree/main/demo/static/demo_video/reference_video.mp4, https://github.com/KTTRCDL/UMETTS/tree/main/demo/static/demo_audio/reference_audio.wav, and https://github.com/KTTRCDL/UMETTS/tree/main/demo/static/demo_audio/demo.wav. [Click on brackets to access source files]. Bottom: Emotion speech synthesizes by Style Transfer Model.
  • Figure 2: The Overview of UMETTS Framework. UMETTS consists of two components: 1) Multimodal Emotional Prompt Alignment (EP-Align) and 2) Emotion embedding-induced TTS (EMI-TTS). EP-Align involves multimodal emotional presentation alignment, empowering EMI-TTS with multimodal emotional information-inducing audio synthesis.
  • Figure 3: Left: Confusion matrix of MELD multi-modal emotion alignment with EP-Align. Right: Samples of RAF compound emotion images aligned with EP-Align.