DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Hyun Joon Park; Jin Sob Kim; Wooseok Shin; Sung Won Han

DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Hyun Joon Park, Jin Sob Kim, Wooseok Shin, Sung Won Han

TL;DR

This work tackles expressive TTS with reference speech by proposing DEX-TTS, a diffusion-based model that learns well-represented style via separate time-invariant (T-IV) and time-variant (T-V) encoders and adapters. It introduces an enhanced DiT diffusion backbone with overlapping patchify and conv-freq embedding to improve latent representations, and uses AdaIN and cross-attention-based adapters to reflect style across time steps. The approach demonstrates superior quality and style similarity on multi-speaker and emotional TTS datasets, including zero-shot scenarios, without requiring pretraining, and its diffusion backbone also improves general TTS on a single-speaker dataset. These results suggest robust, flexible, and scalable expressive TTS suitable for real-world deployment and cross-domain adaptation.

Abstract

Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies. Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone. Demos are available here.

DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

TL;DR

Abstract

Paper Structure (46 sections, 10 equations, 6 figures, 13 tables)

This paper contains 46 sections, 10 equations, 6 figures, 13 tables.

Introduction
Related Works
Diffusion-based Text-to-Speech
Expressive Text-to-Speech
DEX-TTS
Preliminaries
Diffusion Formulation
Overall Architecture
Text Encoder
Aligner
Diffusion Decoder
Time-Invariant Style Modeling
T-IV Encoder
T-IV adapter
Time-Variant Style Modeling
...and 31 more sections

Figures (6)

Figure 1: Architecture of DEX-TTS, diffusion decoder, and style encoders and adapters.
Figure 2: Overall architecture of DEX-TTS, Text encoder, and Diffusion decoder.
Figure 3: Overall architecture of GeDEX-TTS.
Figure 4: Style visualizations using T-SNE on the ESD and VCTK datasets. DWC and DBC indicate distance within clusters and distance between clusters. For the ESD dataset, T-SNE is used based on the five emotions of speaker 0016. For the VCTK dataset, T-SNE is applied based on unseen speakers of the dataset. DEX-TTS trained with each dataset is used for style extraction.
Figure 5: Visualization of mel-spectrograms for reference and synthesized speech on the ESD dataset. The orange lines indicate pitch information. Red boxes are used for comparing frequency bins and blue boxes are used for comparing pause points style.
...and 1 more figures

DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

TL;DR

Abstract

DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Authors

TL;DR

Abstract

Table of Contents

Figures (6)