Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning

Arnav Goel; Medha Hira; Anubha Gupta

Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning

Arnav Goel, Medha Hira, Anubha Gupta

TL;DR

The paper tackles multilingual prosody transfer by comparing two learning strategies—$SFT$ and $TL$—for adapting English pre-trained TTS to German, French, Spanish, Dutch, Hindi, and Tamil. It contrasts a SpeechT5+$x$-vector $SFT$ pipeline with a transfer-learning workflow combining a pre-trained speaker encoder, FreeVC, and MMS TTS for prosody-preserving synthesis. Across six languages, $TL$ yields higher $MOS$, higher $RA$, and lower $MCD$ (e.g., $MOS$ improved by $1.53$ points, $RA$ up by ~$37.5\%, $MCD$ down by ~$7.8$ points), indicating stronger preservation of voice characteristics and prosody. These findings support TL as a more data-efficient path for building multilingual TTS, especially in low-resource settings, and point to future work on frameworks for comparing learning methods in multilingual, low-resource scenarios.

Abstract

The field of prosody transfer in speech synthesis systems is rapidly advancing. This research is focused on evaluating learning methods for adapting pre-trained monolingual text-to-speech (TTS) models to multilingual conditions, i.e., Supervised Fine-Tuning (SFT) and Transfer Learning (TL). This comparison utilizes three distinct metrics: Mean Opinion Score (MOS), Recognition Accuracy (RA), and Mel Cepstral Distortion (MCD). Results demonstrate that, in comparison to SFT, TL leads to significantly enhanced performance, with an average MOS higher by 1.53 points, a 37.5% increase in RA, and approximately a 7.8-point improvement in MCD. These findings are instrumental in helping build TTS models for low-resource languages.

Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning

TL;DR

The paper tackles multilingual prosody transfer by comparing two learning strategies—

and

—for adapting English pre-trained TTS to German, French, Spanish, Dutch, Hindi, and Tamil. It contrasts a SpeechT5+

-vector

pipeline with a transfer-learning workflow combining a pre-trained speaker encoder, FreeVC, and MMS TTS for prosody-preserving synthesis. Across six languages,

yields higher

, higher

, and lower

(e.g.,

improved by

points,

up by ~

MCD

7.8$ points), indicating stronger preservation of voice characteristics and prosody. These findings support TL as a more data-efficient path for building multilingual TTS, especially in low-resource settings, and point to future work on frameworks for comparing learning methods in multilingual, low-resource scenarios.

Abstract

Paper Structure (11 sections, 6 figures, 2 tables)

This paper contains 11 sections, 6 figures, 2 tables.

Introduction
Related Work
Dataset
Methodology
Results
Conclusion and Future Work
Appendix
Acronyms Used
Speaker Embeddings
Fine-Tuning Implementation Details and Plots
MOS and Recognition Accuracy Calculation Protocol

Figures (6)

Figure 1: Emotion
Figure 2: Gender
Figure 4: Training Loss vs Epochs for French
Figure 5: Validation Loss vs Epochs for French
Figure 6: Comparing Waveforms of the three audio clips
...and 1 more figures

Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning

TL;DR

Abstract

Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)