Table of Contents
Fetching ...

A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge

Xiaopeng Wang, Yi Lu, Xin Qi, Zhiyong Wang, Yuankun Xie, Shuchen Shi, Ruibo Fu

TL;DR

This work tackles multi-speaker, multi-lingual Indic TTS with voice cloning across seven languages. It extends the VITS2 framework with multilingual identifiers and IndicBERT-enhanced context in the text encoder, coupled with IPA-based phoneme input and language/speaker conditioning. The model is pre-trained on seven languages and fine-tuned in a few-shot regime to clone target voices, achieving MOS of 3.04–3.12 for naturalness and speaker similarity scores up to 4.17, with strong rankings in Track 2. The approach demonstrates robust cross-lingual voice cloning and practical viability for diverse Indic languages under varying data allowances.

Abstract

This paper presents the development of a speech synthesis system for the LIMMITS'24 Challenge, focusing primarily on Track 2. The objective of the challenge is to establish a multi-speaker, multi-lingual Indic Text-to-Speech system with voice cloning capabilities, covering seven Indian languages with both male and female speakers. The system was trained using challenge data and fine-tuned for few-shot voice cloning on target speakers. Evaluation included both mono-lingual and cross-lingual synthesis across all seven languages, with subjective tests assessing naturalness and speaker similarity. Our system uses the VITS2 architecture, augmented with a multi-lingual ID and a BERT model to enhance contextual language comprehension. In Track 1, where no additional data usage was permitted, our model achieved a Speaker Similarity score of 4.02. In Track 2, which allowed the use of extra data, it attained a Speaker Similarity score of 4.17.

A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge

TL;DR

This work tackles multi-speaker, multi-lingual Indic TTS with voice cloning across seven languages. It extends the VITS2 framework with multilingual identifiers and IndicBERT-enhanced context in the text encoder, coupled with IPA-based phoneme input and language/speaker conditioning. The model is pre-trained on seven languages and fine-tuned in a few-shot regime to clone target voices, achieving MOS of 3.04–3.12 for naturalness and speaker similarity scores up to 4.17, with strong rankings in Track 2. The approach demonstrates robust cross-lingual voice cloning and practical viability for diverse Indic languages under varying data allowances.

Abstract

This paper presents the development of a speech synthesis system for the LIMMITS'24 Challenge, focusing primarily on Track 2. The objective of the challenge is to establish a multi-speaker, multi-lingual Indic Text-to-Speech system with voice cloning capabilities, covering seven Indian languages with both male and female speakers. The system was trained using challenge data and fine-tuned for few-shot voice cloning on target speakers. Evaluation included both mono-lingual and cross-lingual synthesis across all seven languages, with subjective tests assessing naturalness and speaker similarity. Our system uses the VITS2 architecture, augmented with a multi-lingual ID and a BERT model to enhance contextual language comprehension. In Track 1, where no additional data usage was permitted, our model achieved a Speaker Similarity score of 4.02. In Track 2, which allowed the use of extra data, it attained a Speaker Similarity score of 4.17.

Paper Structure

This paper contains 9 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: multilingual multi-speaker text encoder with BERT