DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability
Hyun Joon Park, Jin Sob Kim, Wooseok Shin, Sung Won Han
TL;DR
This work tackles expressive TTS with reference speech by proposing DEX-TTS, a diffusion-based model that learns well-represented style via separate time-invariant (T-IV) and time-variant (T-V) encoders and adapters. It introduces an enhanced DiT diffusion backbone with overlapping patchify and conv-freq embedding to improve latent representations, and uses AdaIN and cross-attention-based adapters to reflect style across time steps. The approach demonstrates superior quality and style similarity on multi-speaker and emotional TTS datasets, including zero-shot scenarios, without requiring pretraining, and its diffusion backbone also improves general TTS on a single-speaker dataset. These results suggest robust, flexible, and scalable expressive TTS suitable for real-world deployment and cross-domain adaptation.
Abstract
Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies. Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone. Demos are available here.
