Table of Contents
Fetching ...

Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning

Arnav Goel, Medha Hira, Anubha Gupta

TL;DR

The paper tackles multilingual prosody transfer by comparing two learning strategies—$SFT$ and $TL$—for adapting English pre-trained TTS to German, French, Spanish, Dutch, Hindi, and Tamil. It contrasts a SpeechT5+$x$-vector $SFT$ pipeline with a transfer-learning workflow combining a pre-trained speaker encoder, FreeVC, and MMS TTS for prosody-preserving synthesis. Across six languages, $TL$ yields higher $MOS$, higher $RA$, and lower $MCD$ (e.g., $MOS$ improved by $1.53$ points, $RA$ up by ~$37.5\%, $MCD$ down by ~$7.8$ points), indicating stronger preservation of voice characteristics and prosody. These findings support TL as a more data-efficient path for building multilingual TTS, especially in low-resource settings, and point to future work on frameworks for comparing learning methods in multilingual, low-resource scenarios.

Abstract

The field of prosody transfer in speech synthesis systems is rapidly advancing. This research is focused on evaluating learning methods for adapting pre-trained monolingual text-to-speech (TTS) models to multilingual conditions, i.e., Supervised Fine-Tuning (SFT) and Transfer Learning (TL). This comparison utilizes three distinct metrics: Mean Opinion Score (MOS), Recognition Accuracy (RA), and Mel Cepstral Distortion (MCD). Results demonstrate that, in comparison to SFT, TL leads to significantly enhanced performance, with an average MOS higher by 1.53 points, a 37.5% increase in RA, and approximately a 7.8-point improvement in MCD. These findings are instrumental in helping build TTS models for low-resource languages.

Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning

TL;DR

The paper tackles multilingual prosody transfer by comparing two learning strategies— and —for adapting English pre-trained TTS to German, French, Spanish, Dutch, Hindi, and Tamil. It contrasts a SpeechT5+-vector pipeline with a transfer-learning workflow combining a pre-trained speaker encoder, FreeVC, and MMS TTS for prosody-preserving synthesis. Across six languages, yields higher , higher , and lower (e.g., improved by points, up by ~MCD7.8$ points), indicating stronger preservation of voice characteristics and prosody. These findings support TL as a more data-efficient path for building multilingual TTS, especially in low-resource settings, and point to future work on frameworks for comparing learning methods in multilingual, low-resource scenarios.

Abstract

The field of prosody transfer in speech synthesis systems is rapidly advancing. This research is focused on evaluating learning methods for adapting pre-trained monolingual text-to-speech (TTS) models to multilingual conditions, i.e., Supervised Fine-Tuning (SFT) and Transfer Learning (TL). This comparison utilizes three distinct metrics: Mean Opinion Score (MOS), Recognition Accuracy (RA), and Mel Cepstral Distortion (MCD). Results demonstrate that, in comparison to SFT, TL leads to significantly enhanced performance, with an average MOS higher by 1.53 points, a 37.5% increase in RA, and approximately a 7.8-point improvement in MCD. These findings are instrumental in helping build TTS models for low-resource languages.
Paper Structure (11 sections, 6 figures, 2 tables)

This paper contains 11 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Emotion
  • Figure 2: Gender
  • Figure 4: Training Loss vs Epochs for French
  • Figure 5: Validation Loss vs Epochs for French
  • Figure 6: Comparing Waveforms of the three audio clips
  • ...and 1 more figures