Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining

Jinlong Xue; Yayue Deng; Yingming Gao; Ya Li

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining

Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li

TL;DR

The paper addresses zero-shot, prompt-based TTS performance being highly sensitive to the chosen speech prompt. It proposes a retrieval-augmented framework that uses CA-CLAP to form a context-aware shared embedding space for multi-modal retrieval and a GPT-SoVITS-based generator. Key contributions include the CA-CLAP model with cross-attention and a two-stage TTS pipeline, plus extensive objective and subjective validation demonstrating improvements over text-only and random-prompt baselines, with context length $l=5$ yielding peak retrieval. The approach enhances voice cloning control and naturalness in context-rich TTS scenarios, suggesting practical utility for audiobook and conversational synthesis.

Abstract

Recent prompt-based text-to-speech (TTS) models can clone an unseen speaker using only a short speech prompt. They leverage a strong in-context ability to mimic the speech prompts, including speaker style, prosody, and emotion. Therefore, the selection of a speech prompt greatly influences the generated speech, akin to the importance of a prompt in large language models (LLMs). However, current prompt-based TTS models choose the speech prompt manually or simply at random. Hence, in this paper, we adapt retrieval augmented generation (RAG) from LLMs to prompt-based TTS. Unlike traditional RAG methods, we additionally consider contextual information during the retrieval process and present a Context-Aware Contrastive Language-Audio Pre-training (CA-CLAP) model to extract context-aware, style-related features. The objective and subjective evaluations demonstrate that our proposed RAG method outperforms baselines, and our CA-CLAP achieves better results than text-only retrieval methods.

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining

TL;DR

yielding peak retrieval. The approach enhances voice cloning control and naturalness in context-rich TTS scenarios, suggesting practical utility for audiobook and conversational synthesis.

Abstract

Paper Structure (14 sections, 4 equations, 2 figures, 3 tables)

This paper contains 14 sections, 4 equations, 2 figures, 3 tables.

Introduction
Methodology
RAG-enhanced Prompt-based TTS
Context-Aware Contrastive Language-audio Pretraining (CA-CLAP)
Prompt-based Text-to-Speech
Experiments
Training Setup
Compared Methods
Objective Evaluation
Subjective Evaluation
Effects of Context Length
Effects of Speech Prompt Number
Conclusion
Acknowledgements

Figures (2)

Figure 1: An overview of our RAG-enhanced prompt-based TTS
Figure 2: Left: An Overview of Context-Aware CLAP. Right: An illustration of the inference process of Prompt-based TTS

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining

TL;DR

Abstract

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (2)