Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages
Akshit Arora, Rohan Badlani, Sungwon Kim, Rafael Valle, Bryan Catanzaro
TL;DR
The paper tackles multi-speaker multilingual TTS with zero-shot capability for Indic languages. It presents two approaches: RAD-MMM for few-shot synthesis and P-Flow for zero-shot synthesis, augmented with external datasets and careful preprocessing. Quantitative results show Track 1–2 performance with RAD-MMM and Track 3 performance with P-Flow achieving MOS $=4.4$ and SMOS $=3.62$. The study demonstrates that disentanglement-based, language-agnostic conditioning combined with speech-prompt-driven zero-shot adaptation can deliver high-quality cross-lingual TTS using HiFi-GAN vocoders.
Abstract
In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.
