Table of Contents
Fetching ...

Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping

Minki Kang, Wooseok Han, Eunho Yang

TL;DR

This work tackles zero-shot TTS from face images by separating speaker identity from speech style using a dedicated face encoder and a prosody encoder. Face-StyleSpeech introduces discrete prosody codes via vector quantization and a Prosody Language Model to generate prosody codes from text, reducing the burden on the face representation to capture all speech style. The approach yields more natural, intelligible speech with better voice-face alignment for unseen faces, and demonstrates improved diversity and consistency across frames. Overall, the method advances face-to-voice mapping by leveraging prosody disentanglement to produce realistic, face-consistent voice synthesis without reference audio.

Abstract

Generating speech from a face image is crucial for developing virtual humans capable of interacting using their unique voices, without relying on pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a zero-shot Text-To-Speech (TTS) synthesis model that generates natural speech conditioned on a face image rather than reference speech. We hypothesize that learning entire prosodic features from a face image poses a significant challenge. To address this, our TTS model incorporates both face and prosody encoders. The prosody encoder is specifically designed to model speech style characteristics that are not fully captured by the face image, allowing the face encoder to focus on extracting speaker-specific features such as timbre. Experimental results demonstrate that Face-StyleSpeech effectively generates more natural speech from a face image than baselines, even for unseen faces. Samples are available on our demo page.

Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping

TL;DR

This work tackles zero-shot TTS from face images by separating speaker identity from speech style using a dedicated face encoder and a prosody encoder. Face-StyleSpeech introduces discrete prosody codes via vector quantization and a Prosody Language Model to generate prosody codes from text, reducing the burden on the face representation to capture all speech style. The approach yields more natural, intelligible speech with better voice-face alignment for unseen faces, and demonstrates improved diversity and consistency across frames. Overall, the method advances face-to-voice mapping by leveraging prosody disentanglement to produce realistic, face-consistent voice synthesis without reference audio.

Abstract

Generating speech from a face image is crucial for developing virtual humans capable of interacting using their unique voices, without relying on pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a zero-shot Text-To-Speech (TTS) synthesis model that generates natural speech conditioned on a face image rather than reference speech. We hypothesize that learning entire prosodic features from a face image poses a significant challenge. To address this, our TTS model incorporates both face and prosody encoders. The prosody encoder is specifically designed to model speech style characteristics that are not fully captured by the face image, allowing the face encoder to focus on extracting speaker-specific features such as timbre. Experimental results demonstrate that Face-StyleSpeech effectively generates more natural speech from a face image than baselines, even for unseen faces. Samples are available on our demo page.
Paper Structure (11 sections, 4 equations, 4 figures, 2 tables)

This paper contains 11 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Concept. Previous methods model both speaker-identity and speech style from the face image, which is highly challenging. Instead, we model the speech style with the prosody codes to improve the stability of face-to-voice mapping only on speaker-related features.
  • Figure 2: The overview of Face-StyleSpeech. (1) The TTS model generates speech given the text embedding, prosody codes, and the speech vector. (2) We train the face encoder to generate a face vector corresponding to the paired speech vector. (3) In inference, we use the face vector and prosody codes from the prosody language model.
  • Figure 3: Preference Test. Results of (Left) Face-based Voice Preference and (Right) Voice-based Face Preference Tests.
  • Figure 4: Analysis on Effects of Different Frames. We visualize the mel-spectrograms of synthesized speech from the face images. On the right, we plot the SECS between speech samples and the speech from the target face.