Table of Contents
Fetching ...

Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis

Minsu Kim, Pingchuan Ma, Honglie Chen, Stavros Petridis, Maja Pantic

TL;DR

This paper tackles face-driven TTS with controllable speech attributes by introducing RV-TTS, a Transformer-based speech language model that generates voices conditioned on a face image while adapting pace, tone, volume, distance, and place via natural descriptive text. It addresses three core challenges: limited AV speech quality, applicability to artistic portraits, and the one-to-many face-to-voice mapping, through (i) mixing HQ audio-only data with AV data via alternating voice embeddings, (ii) style augmentation of input faces to bridge real and artistic appearances, and (iii) sampling-based decoding with prompting to realize diverse yet consistent voices. Key contributions include a shared-embedding pretraining strategy, a style-augmented, cross-attentive RVQ-based speech LM, and a controllable, portrait-capable TTS system that outperforms prior face-driven methods in MOS and face-voice alignment, while enabling natural language-based control of speech attributes. The approach has implications for voice synthesis of historical figures and artworks, offering flexible and controllable voice generation with practical quality improvements, though it also invites consideration of ethical use in impersonation and attribution.

Abstract

This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech samples. Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.

Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis

TL;DR

This paper tackles face-driven TTS with controllable speech attributes by introducing RV-TTS, a Transformer-based speech language model that generates voices conditioned on a face image while adapting pace, tone, volume, distance, and place via natural descriptive text. It addresses three core challenges: limited AV speech quality, applicability to artistic portraits, and the one-to-many face-to-voice mapping, through (i) mixing HQ audio-only data with AV data via alternating voice embeddings, (ii) style augmentation of input faces to bridge real and artistic appearances, and (iii) sampling-based decoding with prompting to realize diverse yet consistent voices. Key contributions include a shared-embedding pretraining strategy, a style-augmented, cross-attentive RVQ-based speech LM, and a controllable, portrait-capable TTS system that outperforms prior face-driven methods in MOS and face-voice alignment, while enabling natural language-based control of speech attributes. The approach has implications for voice synthesis of historical figures and artworks, offering flexible and controllable voice generation with practical quality improvements, though it also invites consideration of ethical use in impersonation and attribution.

Abstract

This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech samples. Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.

Paper Structure

This paper contains 16 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the proposed RV-TTS: The face image controls the voice, the descriptive text controls speech characteristics, and the input text determines the content of speech.
  • Figure 2: Illustration of the proposed RV-TTS. (a) The face image is randomly stylized using a pre-trained style transfer model to reduce the gap between real human faces and artistic portraits. (b) The face encoder and audio encoder are pre-trained through contrastive learning to share a common embedding space. (c) During training, the model alternates between face-driven and audio-driven voice embeddings to learn not only to associate face images with voices but also to synthesize high-quality audio.
  • Figure 3: Speaker identification test results comparisons.