Table of Contents
Fetching ...

Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding

Tianyun Liu

TL;DR

Clip-TTS tackles the lack of semantic context in traditional phoneme-to-Mel mapping by employing Clip-style contrastive learning to align text content with real Mel spectrograms. It introduces Speech-Clip as a multimodal foundation and extends it into Clip-TTS by adding a Mel decoder and vocoder for end-to-end synthesis while preserving fast Transformer-based inference. Experimental results across LJSpeech, Baker, AISHELL3, LibriTTS, and multi-emotion datasets show strong MOS performance, with state-of-the-art results on Baker and robust expressiveness in multilingual and emotional settings. The work also outlines promising avenues for Speech-Clip in downstream tasks like speech recognition, translation, and enhancement, and discusses future directions toward zero-shot TTS and larger-scale models.

Abstract

Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of the model with the quality of the synthesized speech. Methods that generate high-quality synthesized speech tend to have slower inference speeds, while faster inference methods often sacrifice speech quality. In this paper, I propose Clip-TTS, a TTS method based on the Clip architecture. This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage, enabling the text encoder to directly learn the true semantics of the global context, thereby ensuring the quality of the synthesized speech. In terms of model architecture, I adopt the basic structure of Transformer, which allows Clip-TTS to achieve fast inference speeds. Experimental results show that on the LJSpeech and Baker datasets, the speech generated by Clip-TTS achieves state-of-the-art MOS scores, and it also performs excellently on multi-emotion datasets.Audio samples are available at: https://ltydd1314.github.io/.

Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding

TL;DR

Clip-TTS tackles the lack of semantic context in traditional phoneme-to-Mel mapping by employing Clip-style contrastive learning to align text content with real Mel spectrograms. It introduces Speech-Clip as a multimodal foundation and extends it into Clip-TTS by adding a Mel decoder and vocoder for end-to-end synthesis while preserving fast Transformer-based inference. Experimental results across LJSpeech, Baker, AISHELL3, LibriTTS, and multi-emotion datasets show strong MOS performance, with state-of-the-art results on Baker and robust expressiveness in multilingual and emotional settings. The work also outlines promising avenues for Speech-Clip in downstream tasks like speech recognition, translation, and enhancement, and discusses future directions toward zero-shot TTS and larger-scale models.

Abstract

Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of the model with the quality of the synthesized speech. Methods that generate high-quality synthesized speech tend to have slower inference speeds, while faster inference methods often sacrifice speech quality. In this paper, I propose Clip-TTS, a TTS method based on the Clip architecture. This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage, enabling the text encoder to directly learn the true semantics of the global context, thereby ensuring the quality of the synthesized speech. In terms of model architecture, I adopt the basic structure of Transformer, which allows Clip-TTS to achieve fast inference speeds. Experimental results show that on the LJSpeech and Baker datasets, the speech generated by Clip-TTS achieves state-of-the-art MOS scores, and it also performs excellently on multi-emotion datasets.Audio samples are available at: https://ltydd1314.github.io/.

Paper Structure

This paper contains 7 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The general framework of Speech-Clip
  • Figure 2: The general framework of Text Encoder
  • Figure 3: The general framework of Clip-TTS
  • Figure 4: The general framework of Clip-TTS 2
  • Figure 5: The downstream tasks that Speech-Clip may support