Table of Contents
Fetching ...

Combining Masked Language Modeling and Cross-Modal Contrastive Learning for Prosody-Aware TTS

Kirill Borodin, Vasiliy Kudryavtsev, Maxim Maslov, Nikita Vasiliev, Mikhail Gorodnichev, Grach Mkrtchian

Abstract

We investigate multi-stage pretraining for prosody modeling in diffusion-based TTS. A speaker-conditioned dual-stream encoder is trained with masked language modeling followed by SigLIP-style cross-modal contrastive learning using mixed-phoneme batches, with an additional same-phoneme refinement stage studied separately. We evaluate intrinsic text-audio retrieval and downstream synthesis in Grad-TTS and a latent diffusion TTS system. The two-stage curriculum (MLM + mixed-phoneme contrastive learning) achieves the best overall synthesis quality in terms of intelligibility, speaker similarity, and perceptual measures. Although same-phoneme refinement improves prosodic retrieval, it reduces phoneme discrimination and degrades synthesis. These findings indicate that improvements in embedding-space metrics do not necessarily translate to better generative performance and highlight the need to balance phoneme discrimination and prosodic sensitivity in TTS pretraining.

Combining Masked Language Modeling and Cross-Modal Contrastive Learning for Prosody-Aware TTS

Abstract

We investigate multi-stage pretraining for prosody modeling in diffusion-based TTS. A speaker-conditioned dual-stream encoder is trained with masked language modeling followed by SigLIP-style cross-modal contrastive learning using mixed-phoneme batches, with an additional same-phoneme refinement stage studied separately. We evaluate intrinsic text-audio retrieval and downstream synthesis in Grad-TTS and a latent diffusion TTS system. The two-stage curriculum (MLM + mixed-phoneme contrastive learning) achieves the best overall synthesis quality in terms of intelligibility, speaker similarity, and perceptual measures. Although same-phoneme refinement improves prosodic retrieval, it reduces phoneme discrimination and degrades synthesis. These findings indicate that improvements in embedding-space metrics do not necessarily translate to better generative performance and highlight the need to balance phoneme discrimination and prosodic sensitivity in TTS pretraining.

Paper Structure

This paper contains 19 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Prosody encoder architecture. Phoneme and BPE streams are independently embedded and processed by speaker-conditioned transformer encoders. BPE hidden states are aggregated via word-level pooling and expanded to phoneme resolution. The fused streams pass through a shared encoder, layer normalization, and a convolutional projection to produce per-phoneme prosodic embeddings.