Table of Contents
Fetching ...

DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Hyun Joon Park, Jin Sob Kim, Wooseok Shin, Sung Won Han

TL;DR

This work tackles expressive TTS with reference speech by proposing DEX-TTS, a diffusion-based model that learns well-represented style via separate time-invariant (T-IV) and time-variant (T-V) encoders and adapters. It introduces an enhanced DiT diffusion backbone with overlapping patchify and conv-freq embedding to improve latent representations, and uses AdaIN and cross-attention-based adapters to reflect style across time steps. The approach demonstrates superior quality and style similarity on multi-speaker and emotional TTS datasets, including zero-shot scenarios, without requiring pretraining, and its diffusion backbone also improves general TTS on a single-speaker dataset. These results suggest robust, flexible, and scalable expressive TTS suitable for real-world deployment and cross-domain adaptation.

Abstract

Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies. Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone. Demos are available here.

DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

TL;DR

This work tackles expressive TTS with reference speech by proposing DEX-TTS, a diffusion-based model that learns well-represented style via separate time-invariant (T-IV) and time-variant (T-V) encoders and adapters. It introduces an enhanced DiT diffusion backbone with overlapping patchify and conv-freq embedding to improve latent representations, and uses AdaIN and cross-attention-based adapters to reflect style across time steps. The approach demonstrates superior quality and style similarity on multi-speaker and emotional TTS datasets, including zero-shot scenarios, without requiring pretraining, and its diffusion backbone also improves general TTS on a single-speaker dataset. These results suggest robust, flexible, and scalable expressive TTS suitable for real-world deployment and cross-domain adaptation.

Abstract

Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies. Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone. Demos are available here.
Paper Structure (46 sections, 10 equations, 6 figures, 13 tables)

This paper contains 46 sections, 10 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Architecture of DEX-TTS, diffusion decoder, and style encoders and adapters.
  • Figure 2: Overall architecture of DEX-TTS, Text encoder, and Diffusion decoder.
  • Figure 3: Overall architecture of GeDEX-TTS.
  • Figure 4: Style visualizations using T-SNE on the ESD and VCTK datasets. DWC and DBC indicate distance within clusters and distance between clusters. For the ESD dataset, T-SNE is used based on the five emotions of speaker 0016. For the VCTK dataset, T-SNE is applied based on unseen speakers of the dataset. DEX-TTS trained with each dataset is used for style extraction.
  • Figure 5: Visualization of mel-spectrograms for reference and synthesized speech on the ESD dataset. The orange lines indicate pitch information. Red boxes are used for comparing frequency bins and blue boxes are used for comparing pause points style.
  • ...and 1 more figures