EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech
Xin Qi, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Shuchen Shi, Yi Lu, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Guanjun Li, Xuefei Liu, Yongwei Li
TL;DR
This work tackles the scalability and flexibility challenge of endowing TTS with emotions by introducing EELE, a plug-and-play approach that starts from a neutral TTS model and adds emotion capability post-training via LoRA adapters. By experimenting with eight module-level LoRA deployment schemes in the VITS2 backbone, the authors systematically evaluate where emotion should be modeled and how duration cues influence expression. Key findings show that certain LoRA placements (notably scheme 'g') yield superior emotion recognition, with anger most prominent and surprise the hardest to detect, while smaller LoRA ranks can achieve competitive performance, offering significant efficiency over full fine-tuning. The approach demonstrates practical advantages in adaptability and scalability for emotion-rich TTS across applications like media and gaming.
Abstract
In the current era of Artificial Intelligence Generated Content (AIGC), a Low-Rank Adaptation (LoRA) method has emerged. It uses a plugin-based approach to learn new knowledge with lower parameter quantities and computational costs, and it can be plugged in and out based on the specific sub-tasks, offering high flexibility. However, the current application schemes primarily incorporate LoRA into the pre-introduced conditional parts of the speech models. This fixes the position of LoRA, limiting the flexibility and scalability of its application. Therefore, we propose the Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech (EELE) method. Starting from a general neutral speech model, we do not pre-introduce emotional information but instead use the LoRA plugin to design a flexible adaptive scheme that endows the model with emotional generation capabilities. Specifically, we initially train the model using only neutral speech data. After training is complete, we insert LoRA into different modules and fine-tune the model with emotional speech data to find the optimal insertion scheme. Through experiments, we compare and test the effects of inserting LoRA at different positions within the model and assess LoRA's ability to learn various emotions, effectively proving the validity of our method. Additionally, we explore the impact of the rank size of LoRA and the difference compared to directly fine-tuning the entire model.
