Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis
Pengchao Feng, Yao Xiao, Ziyang Ma, Zhikang Niu, Shuai Fan, Yao Li, Sheng Wang, Xie Chen
TL;DR
This paper tackles the challenge of generating emotionally expressive dialectal speech without requiring datasets labeled for both dialect and emotion. It introduces E-Vector to model single expressive styles via task-vector-based parameter perturbations on a base TTS model, and extends this with LoRA variants for efficiency. The core contribution is HE-Vector, a hierarchical merging strategy that combines dialect and emotion E-Vectors across separate model layers, enabling joint control with reduced interference and no joint-labeled data. Empirical results show strong dialect synthesis performance and promising zero-shot emotionally expressive dialect synthesis, demonstrating data-efficient, multi-style TTS with practical impact for diverse speaking styles.
Abstract
Recent advances in text-to-speech (TTS) have yielded remarkable improvements in naturalness and intelligibility. Building on these achievements, research has increasingly shifted toward enhancing the expressiveness of generated speech, such as dialectal and emotional TTS. However, cross-style synthesis combining both dialect and emotion remains challenging and largely unexplored, mainly due to the scarcity of dialectal data with emotional labels. To address this, we propose Hierarchical Expressive Vector (HE-Vector), a two-stage method for Emotional Dialectal TTS. In the first stage, we construct different task vectors to model dialectal and emotional styles independently, and then enhance single-style synthesis by adjusting their weights, a method we refer to as Expressive Vector (E-Vector). For the second stage, we hierarchically integrate these vectors to achieve controllable emotionally expressive dialect synthesis without requiring jointly labeled data, corresponding to Hierarchical Expressive Vector (HE-Vector). Experimental results demonstrate that HE-Vectors achieve superior performance in dialect synthesis, and promising results in synthesizing emotionally expressive dialectal speech in a zero-shot setting.
