Table of Contents
Fetching ...

Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling

Quanxiu Wang, Hui Huang, Mingjie Wang, Yong Dai, Jinzuomu Zhong, Benlai Tang

TL;DR

The paper tackles the bottleneck of TTS frontend modeling by learning cross-modal representations from text-audio pairs without heavy supervision. It introduces TAP-FM, a two-stage framework where Stage I MC-TAP performs prior-agnostic multi-scale contrastive pre-training across span-level and sentence-level semantics, augmented with a Masked Language Modeling objective, with the overall pre-training loss $\mathcal{L} = \mathcal{L}_{span} + \alpha \mathcal{L}_{sen} + \beta \mathcal{L}_{mlm}$ and $\mathcal{L}_{span} = \frac{\mathcal{L}_{T2A} + \mathcal{L}_{A2T}}{2}$. Stage II deploys a parallelized frontend that predicts TN, PBP, and PD using a shared TextEncoder from MC-TAP and a ResConformer backbone, guided by a modified Dynamic Weight Averaging (DWA+) to balance tasks. Extensive experiments on LibriSpeech, GigaSpeech, and LJSpeech show state-of-the-art results, particularly in PBP, with ablations confirming the importance of span-level alignment, MLM, and TE from MC-TAP for downstream TTS frontend modeling.

Abstract

Over the past decade, a series of unflagging efforts have been dedicated to developing highly expressive and controllable text-to-speech (TTS) systems. In general, the holistic TTS comprises two interconnected components: the frontend module and the backend module. The frontend excels in capturing linguistic representations from the raw text input, while the backend module converts linguistic cues to speech. The research community has shown growing interest in the study of the frontend component, recognizing its pivotal role in text-to-speech systems, including Text Normalization (TN), Prosody Boundary Prediction (PBP), and Polyphone Disambiguation (PD). Nonetheless, the limitations posed by insufficient annotated textual data and the reliance on homogeneous text signals significantly undermine the effectiveness of its supervised learning. To evade this obstacle, a novel two-stage TTS frontend prediction pipeline, named TAP-FM, is proposed in this paper. Specifically, during the first learning phase, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner. Instead of mining homogeneous features in prior pre-training approaches, our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations. Furthermore, a parallelized TTS frontend model is delicately devised to execute TN, PD, and PBP prediction tasks, respectively in the second stage. Finally, extensive experiments illustrate the superiority of our proposed method, achieving state-of-the-art performance.

Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling

TL;DR

The paper tackles the bottleneck of TTS frontend modeling by learning cross-modal representations from text-audio pairs without heavy supervision. It introduces TAP-FM, a two-stage framework where Stage I MC-TAP performs prior-agnostic multi-scale contrastive pre-training across span-level and sentence-level semantics, augmented with a Masked Language Modeling objective, with the overall pre-training loss and . Stage II deploys a parallelized frontend that predicts TN, PBP, and PD using a shared TextEncoder from MC-TAP and a ResConformer backbone, guided by a modified Dynamic Weight Averaging (DWA+) to balance tasks. Extensive experiments on LibriSpeech, GigaSpeech, and LJSpeech show state-of-the-art results, particularly in PBP, with ablations confirming the importance of span-level alignment, MLM, and TE from MC-TAP for downstream TTS frontend modeling.

Abstract

Over the past decade, a series of unflagging efforts have been dedicated to developing highly expressive and controllable text-to-speech (TTS) systems. In general, the holistic TTS comprises two interconnected components: the frontend module and the backend module. The frontend excels in capturing linguistic representations from the raw text input, while the backend module converts linguistic cues to speech. The research community has shown growing interest in the study of the frontend component, recognizing its pivotal role in text-to-speech systems, including Text Normalization (TN), Prosody Boundary Prediction (PBP), and Polyphone Disambiguation (PD). Nonetheless, the limitations posed by insufficient annotated textual data and the reliance on homogeneous text signals significantly undermine the effectiveness of its supervised learning. To evade this obstacle, a novel two-stage TTS frontend prediction pipeline, named TAP-FM, is proposed in this paper. Specifically, during the first learning phase, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner. Instead of mining homogeneous features in prior pre-training approaches, our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations. Furthermore, a parallelized TTS frontend model is delicately devised to execute TN, PD, and PBP prediction tasks, respectively in the second stage. Finally, extensive experiments illustrate the superiority of our proposed method, achieving state-of-the-art performance.
Paper Structure (18 sections, 11 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 11 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of the proposed system. Stage I: The prior-agnostic Multi-scale Contrastive Text-audio Pre-training (MC-TAP). Stage II: Our parallelized TTS frontend model for predicting TN, PBP, and PD tasks. ResultMerge is a rule-based approach designed to address prosody boundary and polyphone issues within NSWs.
  • Figure 2: The cosine similarity matrix between word vectors and corresponding audio segment features from a real case. Darker colors represent higher similarity.