Table of Contents
Fetching ...

Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation

Qi Wang, Shituo Ma, Guoxin Yu, Hanyang Peng, Yue Yu

TL;DR

Fed-PISA addresses privacy-preserving, personalized voice cloning in federated learning by disentangling speaker timbre from style via a private Identity-LoRA and a global Style-LoRA. The server aggregates style updates with a personalized attention mechanism over stylistically similar peers, enabling richer cross-client style transfer while keeping data on-device. The approach reduces communication costs compared to prior federated TTS methods and yields improved style expressivity, speaker similarity, and perceived naturalness. This work advances practical, privacy-preserving, personalized TTS by combining PEFT with collaborative style learning.

Abstract

Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker's timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.

Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation

TL;DR

Fed-PISA addresses privacy-preserving, personalized voice cloning in federated learning by disentangling speaker timbre from style via a private Identity-LoRA and a global Style-LoRA. The server aggregates style updates with a personalized attention mechanism over stylistically similar peers, enabling richer cross-client style transfer while keeping data on-device. The approach reduces communication costs compared to prior federated TTS methods and yields improved style expressivity, speaker similarity, and perceived naturalness. This work advances practical, privacy-preserving, personalized TTS by combining PEFT with collaborative style learning.

Abstract

Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker's timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.

Paper Structure

This paper contains 16 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: An overview of the FED-PISA framework. On the client side, a private ID-LoRA captures speaker timbre locally, while only a lightweight Style-LoRA is trained and uploaded for aggregation. The server then employs a personalized aggregation strategy to create a custom style model for each client by learning from stylistically similar peers.
  • Figure 2: Performance trade-off when allocating training steps between ID-LoRA ($n$) and Style-LoRA ($m$). The solid line denotes the mean performance over three independent runs with different random seeds, and the shaded region represents the standard deviation across runs.