Table of Contents
Fetching ...

EE-TTS: Emphatic Expressive TTS with Linguistic Information

Yi Zhong, Chen Zhang, Xule Liu, Chenxi Sun, Weishan Deng, Haifeng Hu, Zhongqian Sun

TL;DR

EE-TTS tackles the expressiveness gap in TTS by leveraging multi-level linguistic information (syntax and semantics) to predict emphasis positions without explicit labels and to condition the acoustic model. It integrates a linguistic information extractor, an emphasis predictor, and a conditioned acoustic model built on a FastSpeech2 backbone with a Conformer encoder, and it pre-trains using unsupervised emphasis labeling derived from Wavelet Prosody Toolkits. The approach yields significant MOS gains in expressiveness ($0.49$) and naturalness ($0.67$) over a baseline, and demonstrates robust cross-dataset generalization via AB tests. This work offers a practical pathway to more expressive TTS systems by reducing reliance on manually annotated emphasis data while exploiting rich linguistic cues.

Abstract

While Current TTS systems perform well in synthesizing high-quality speech, producing highly expressive speech remains a challenge. Emphasis, as a critical factor in determining the expressiveness of speech, has attracted more attention nowadays. Previous works usually enhance the emphasis by adding intermediate features, but they can not guarantee the overall expressiveness of the speech. To resolve this matter, we propose Emphatic Expressive TTS (EE-TTS), which leverages multi-level linguistic information from syntax and semantics. EE-TTS contains an emphasis predictor that can identify appropriate emphasis positions from text and a conditioned acoustic model to synthesize expressive speech with emphasis and linguistic information. Experimental results indicate that EE-TTS outperforms baseline with MOS improvements of 0.49 and 0.67 in expressiveness and naturalness. EE-TTS also shows strong generalization across different datasets according to AB test results.

EE-TTS: Emphatic Expressive TTS with Linguistic Information

TL;DR

EE-TTS tackles the expressiveness gap in TTS by leveraging multi-level linguistic information (syntax and semantics) to predict emphasis positions without explicit labels and to condition the acoustic model. It integrates a linguistic information extractor, an emphasis predictor, and a conditioned acoustic model built on a FastSpeech2 backbone with a Conformer encoder, and it pre-trains using unsupervised emphasis labeling derived from Wavelet Prosody Toolkits. The approach yields significant MOS gains in expressiveness () and naturalness () over a baseline, and demonstrates robust cross-dataset generalization via AB tests. This work offers a practical pathway to more expressive TTS systems by reducing reliance on manually annotated emphasis data while exploiting rich linguistic cues.

Abstract

While Current TTS systems perform well in synthesizing high-quality speech, producing highly expressive speech remains a challenge. Emphasis, as a critical factor in determining the expressiveness of speech, has attracted more attention nowadays. Previous works usually enhance the emphasis by adding intermediate features, but they can not guarantee the overall expressiveness of the speech. To resolve this matter, we propose Emphatic Expressive TTS (EE-TTS), which leverages multi-level linguistic information from syntax and semantics. EE-TTS contains an emphasis predictor that can identify appropriate emphasis positions from text and a conditioned acoustic model to synthesize expressive speech with emphasis and linguistic information. Experimental results indicate that EE-TTS outperforms baseline with MOS improvements of 0.49 and 0.67 in expressiveness and naturalness. EE-TTS also shows strong generalization across different datasets according to AB test results.
Paper Structure (19 sections, 2 figures, 3 tables)

This paper contains 19 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The entire framework of EE-TTS. The dashed lines in subfigure (a) indicate the pre-training procedure. Subfigures (b) and (c) show the detailed structure of the linguistic encoder and emphasis predictor respectively, as well as the linguistic information extractor.
  • Figure 2: AB preference results between EE-TTS and baseline of two datasets.