Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation
Qi Wang, Shituo Ma, Guoxin Yu, Hanyang Peng, Yue Yu
TL;DR
Fed-PISA addresses privacy-preserving, personalized voice cloning in federated learning by disentangling speaker timbre from style via a private Identity-LoRA and a global Style-LoRA. The server aggregates style updates with a personalized attention mechanism over stylistically similar peers, enabling richer cross-client style transfer while keeping data on-device. The approach reduces communication costs compared to prior federated TTS methods and yields improved style expressivity, speaker similarity, and perceived naturalness. This work advances practical, privacy-preserving, personalized TTS by combining PEFT with collaborative style learning.
Abstract
Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker's timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.
