Table of Contents
Fetching ...

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech

Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li

TL;DR

DPI-TTS introduces directional patch interaction to exploit the intrinsic acoustic properties of speech, addressing the limitation of treating Mel spectrograms as generic images in DiT-based TTS. The method performs frame-by-frame, low-to-high frequency inference and incorporates fine-grained speaker style temporal modeling via cross-attention, resulting in faster training and improved speech naturalness and speaker similarity. Empirical results on LJSpeech and VCTK show near 2x training speedups without sacrificing accuracy, with improvements in WER, MOS-N, and COS compared to strong baselines. This work demonstrates the viability of transformer-based diffusion approaches for TTS and provides a pathway for more efficient and expressive diffusion-based speech synthesis.

Abstract

In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech

TL;DR

DPI-TTS introduces directional patch interaction to exploit the intrinsic acoustic properties of speech, addressing the limitation of treating Mel spectrograms as generic images in DiT-based TTS. The method performs frame-by-frame, low-to-high frequency inference and incorporates fine-grained speaker style temporal modeling via cross-attention, resulting in faster training and improved speech naturalness and speaker similarity. Empirical results on LJSpeech and VCTK show near 2x training speedups without sacrificing accuracy, with improvements in WER, MOS-N, and COS compared to strong baselines. This work demonstrates the viability of transformer-based diffusion approaches for TTS and provides a pathway for more efficient and expressive diffusion-based speech synthesis.

Abstract

In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.
Paper Structure (17 sections, 2 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: An overview of our method. In the Diffusion Decoder, $k$ global DiT blocks are used to supplement global details such as fundamental frequency, followed by $(N-k)$ directional DiT blocks for precise modeling. In the directional DiT blocks, the first step involves adding time positional embeddings and integrating them with speaker-style information from a temporal perspective. The second step incorporates both time and frequency domain positional embeddings, where each patch computes attention only with three surrounding patches: the previous frame and lower-frequency components.
  • Figure 2: We conducted comparative experiments on the training speed of our method versus the baseline model. Figures (a) and (b) show that within the same number of training epochs, our method can achieve nearly the same performance as the baseline model. Figure (c) shows that, within the same number of epochs, our training speed is nearly twice as fast.