Table of Contents
Fetching ...

Pheme: Efficient and Conversational Speech Generation

Paweł Budzianowski, Taras Sereda, Tomasz Cichy, Ivan Vulić

TL;DR

Pheme targets efficient, real-time, conversational TTS by coupling a compact T2S component with a non-autoregressive A2S decoder that uses MaskGIT-style iterative sampling over RVQ-based speech tokens. The system leverages SpeechTokenizer for joint semantic-acoustic tokenization and augmented speaker embeddings to enable one-shot and zero-shot voice generation, trained on smaller, noisy conversational data (GigaSpeech) with optional distillation from larger teachers. Empirical results show substantial speedups over autoregressive baselines while achieving competitive intelligibility, naturalness, and speaker fidelity; single-speaker specialization is feasible via synthetic data distillation. This work offers a practical, scalable pathway to production-ready conversational TTS and provides open resources to accelerate future developments in efficient, high-quality speech synthesis.

Abstract

In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with large language models might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time usage. In order to mitigate the current limitations of the state-of-the-art TTS models while capitalizing on their strengths, in this work we introduce the Pheme model series that 1) offers compact yet high-performing models, 2) allows for parallel speech generation of 3) natural conversational speech, and 4) it can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models. We also show that through simple teacher-student distillation we can meet significant improvements in voice quality for single-speaker setups on top of pretrained Pheme checkpoints, relying solely on synthetic speech generated by much larger teacher models. Audio samples and pretrained models are available online.

Pheme: Efficient and Conversational Speech Generation

TL;DR

Pheme targets efficient, real-time, conversational TTS by coupling a compact T2S component with a non-autoregressive A2S decoder that uses MaskGIT-style iterative sampling over RVQ-based speech tokens. The system leverages SpeechTokenizer for joint semantic-acoustic tokenization and augmented speaker embeddings to enable one-shot and zero-shot voice generation, trained on smaller, noisy conversational data (GigaSpeech) with optional distillation from larger teachers. Empirical results show substantial speedups over autoregressive baselines while achieving competitive intelligibility, naturalness, and speaker fidelity; single-speaker specialization is feasible via synthetic data distillation. This work offers a practical, scalable pathway to production-ready conversational TTS and provides open resources to accelerate future developments in efficient, high-quality speech synthesis.

Abstract

In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with large language models might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time usage. In order to mitigate the current limitations of the state-of-the-art TTS models while capitalizing on their strengths, in this work we introduce the Pheme model series that 1) offers compact yet high-performing models, 2) allows for parallel speech generation of 3) natural conversational speech, and 4) it can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models. We also show that through simple teacher-student distillation we can meet significant improvements in voice quality for single-speaker setups on top of pretrained Pheme checkpoints, relying solely on synthetic speech generated by much larger teacher models. Audio samples and pretrained models are available online.
Paper Structure (11 sections, 5 tables)