Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
Minsu Kim, Pingchuan Ma, Honglie Chen, Stavros Petridis, Maja Pantic
TL;DR
This paper tackles face-driven TTS with controllable speech attributes by introducing RV-TTS, a Transformer-based speech language model that generates voices conditioned on a face image while adapting pace, tone, volume, distance, and place via natural descriptive text. It addresses three core challenges: limited AV speech quality, applicability to artistic portraits, and the one-to-many face-to-voice mapping, through (i) mixing HQ audio-only data with AV data via alternating voice embeddings, (ii) style augmentation of input faces to bridge real and artistic appearances, and (iii) sampling-based decoding with prompting to realize diverse yet consistent voices. Key contributions include a shared-embedding pretraining strategy, a style-augmented, cross-attentive RVQ-based speech LM, and a controllable, portrait-capable TTS system that outperforms prior face-driven methods in MOS and face-voice alignment, while enabling natural language-based control of speech attributes. The approach has implications for voice synthesis of historical figures and artworks, offering flexible and controllable voice generation with practical quality improvements, though it also invites consideration of ethical use in impersonation and attribution.
Abstract
This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech samples. Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.
