Table of Contents
Fetching ...

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

Tianrui Wang, Haoyu Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Ziyang Ma, Zikang Huang, Guanrou Yang, Xiaobao Wang, Eng Siong Chng, Xie Chen, Longbiao Wang, Jianwu Dang

TL;DR

This work tackles word-level emotional and speaking-rate control in zero-shot TTS, a problem impeded by data scarcity and intra-sentence variation. It introduces WeSCon, a two-stage self-training framework where a first-stage teacher extends a pretrained zero-shot TTS with multi-round inference, transition smoothing, and dynamic speed control to generate word-level expressive speech, and a second-stage student learns end-to-end control under a dynamic emotional attention bias. Across English and Chinese, WeSCon achieves state-of-the-art performance in word-level emotion and speed control while preserving zero-shot capabilities, with ablations validating the contribution of smoothing, speed control, and DEAB. The approach reduces reliance on large, finely annotated datasets and enables practical expressive TTS, though it notes limitations in gradual emotion evolution and diversity, and discusses broader societal impacts and potential misuse with suggested safeguards.

Abstract

While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

TL;DR

This work tackles word-level emotional and speaking-rate control in zero-shot TTS, a problem impeded by data scarcity and intra-sentence variation. It introduces WeSCon, a two-stage self-training framework where a first-stage teacher extends a pretrained zero-shot TTS with multi-round inference, transition smoothing, and dynamic speed control to generate word-level expressive speech, and a second-stage student learns end-to-end control under a dynamic emotional attention bias. Across English and Chinese, WeSCon achieves state-of-the-art performance in word-level emotion and speed control while preserving zero-shot capabilities, with ablations validating the contribution of smoothing, speed control, and DEAB. The approach reduces reliance on large, finely annotated datasets and enables practical expressive TTS, though it notes limitations in gradual emotion evolution and diversity, and discusses broader societal impacts and potential misuse with suggested safeguards.

Abstract

While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.

Paper Structure

This paper contains 53 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Word-level control of emotion and speaking rate aims to modulate both attributes within an utterance, guided by multiple emotional prompts and emotion-speed-tagged text. Our approach, WeSCon, achieves this using only a small-scale public dataset without emotion transitions.
  • Figure 2: Overview of WeSCon. The 1st-stage teacher extends a zero-shot TTS model with dynamic speed control, transition smoothing, and multi-round inference to enable word-level emotion and speaking rate control. In the 2nd stage, it supervises a student model with a dynamic emotion attention bias (DEAB) to achieve the same control in an end-to-end manner with reduced inference complexity.
  • Figure 3: Word-level emotion and speaking rate control using a transition-smoothing module and dynamic speed adjustment. At each inference round, an emotional prompt is used to generate a speech segment, with the tail of the previous output appended to ensure continuity. Speaking rate is controlled by interpolating or downsampling prompt speech tokens. The final utterance is produced by concatenating all segments and decoding them through flow matching and a vocoder.
  • Figure 4: The proposed self-training strategy. A teacher model under a complex multi-round inference manner supervises a student TTS model to enable word-level emotion and speaking rate control. The dynamic emotional attention bias mechanism further enhances expressive generation in a simplified end-to-end single-pass inference manner.
  • Figure 5: Performance trends on Chinese testset under different self-training data sizes.
  • ...and 6 more figures