FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles

Tian-Hao Zhang; Jiawei Zhang; Jun Wang; Xinyuan Qian; Xu-Cheng Yin

FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles

Tian-Hao Zhang, Jiawei Zhang, Jun Wang, Xinyuan Qian, Xu-Cheng Yin

TL;DR

FaceSpeak tackles portrait-driven expressive TTS by learning disentangled identity and emotion cues from diverse style portraits and feeding them into a VITS2-based TTS backbone. It introduces EMTTS, a large multi-style, multi-modal dataset assembled via a collaborative pipeline with ChatGPT, PhotoMaker, and DALL-E-3 to enable robust cross-style synthesis. The method combines FaRL-based visual features with IAM/EAM-style disentanglement and mutual-information decoupling (vCLUB) to produce precise portrait-aligned speech, with losses $L_{vits}$, $L_{mi}$, $L_{emo}$, and $L_{grl}$ guiding training. Experiments show FaceSpeak achieves high naturalness, strong identity/emotion alignment, and effective cross-style control, including mixing identity and emotion cues from separate images and performing well on out-of-domain portraits.

Abstract

Humans can perceive speakers' characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to their voice style. Recently, vision-driven Text-to-speech (TTS) scholars grounded their investigations on real-person faces, thereby restricting effective speech synthesis from applying to vast potential usage scenarios with diverse characters and image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity characteristics and emotional representations from a wide variety of image styles. Meanwhile, it mitigates the extraneous information (e.g., background, clothing, and hair color, etc.), resulting in synthesized speech closely aligned with a character's persona. Furthermore, to overcome the scarcity of multi-modal TTS data, we have devised an innovative dataset, namely Expressive Multi-Modal TTS, which is diligently curated and annotated to facilitate research in this domain. The experimental results demonstrate our proposed FaceSpeak can generate portrait-aligned voice with satisfactory naturalness and quality.

FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles

TL;DR

, and

guiding training. Experiments show FaceSpeak achieves high naturalness, strong identity/emotion alignment, and effective cross-style control, including mixing identity and emotion cues from separate images and performing well on out-of-domain portraits.

Abstract

Paper Structure (15 sections, 6 equations, 6 figures, 4 tables)

This paper contains 15 sections, 6 equations, 6 figures, 4 tables.

Introduction
Related Work
Proposed EMTTS Dataset
EMTTS-MEAD
EMTTS-ESD-EmovDB
Proposed Method
Multi-Style Image Feature Disentanglement
Expressive TTS
Experiments
Dataset and Experimental Setup
Synthetic Quality on Real Portraits
Synthetic Quality on Multi-Style Virtual Portraits
Results of Decoupled Identity and Emotion Information
Conclusion
Acknowledgments

Figures (6)

Figure 1: Our proposed multi-modal speech synthesis framework, namely FaceSpeak, which performs expressive and high-quality speech synthesis, given image prompt of different styles and the content text (Note: image-speech data from various characters are encoded with distinct color codecs).
Figure 2: Our image generation pipeline of 1) EMTTS-MEAD subset (top): we specify the desired output style and transfer the real human image to images of different styles using PhotoMaker. 2) EMTTS-ESD-EmovDB subset (bottom): we use a human expert to label the character factors for chatGPT to create the descriptive text, which is utilized by DALL-E-3 to produce images that are highly aligned with the specified parameters.
Figure 3: Block diagram of our proposed FaceSpeak which generates speech given the input text ${\bf t}_i$ and images of different styles (either real ${\bf I}^R_i$ or generated ${\bf I}^G_i$). It consists of two sub-modules: multi-style image feature disentanglement (yellow region) and expressive TTS (gray region).
Figure 4: Visualization of emotion embeddings (colors index emotions).
Figure 5: Visualization of identity embeddings (colors index identities; M: male; F: female).
...and 1 more figures

FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles

TL;DR

Abstract

FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles

Authors

TL;DR

Abstract

Table of Contents

Figures (6)