Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language

Manjil Karki; Pratik Shakya; Sandesh Acharya; Ravi Pandit; Dinesh Gothe

Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language

Manjil Karki, Pratik Shakya, Sandesh Acharya, Ravi Pandit, Dinesh Gothe

TL;DR

This work addresses Nepali voice cloning in low-resource settings by deploying a transfer-learning–driven three-model pipeline (encoder, synthesizer, vocoder) to generate Nepali-accented speech from text in Devanagari. It builds a Nepali speech corpus of 546 speakers totaling ~168 hours, and uses a Tacotron2-based synthesizer paired with a WaveNet vocoder, guided by data preprocessing and GTA alignment workflows. Transfer learning from a multilingual model is key, enabling better speaker representation (GE2E loss $0.02 \pm 0.01$, EER $0.005 \pm 0.001$) and improved perceptual quality (MOS naturalness $3.93$, similarity $3.29$; PESQ $2.8$ val., $2.3$ test). Overall, the system demonstrates feasible Nepali voice cloning under data constraints, with strong embedding alignment and perceptual scores, while highlighting dataset quality as a limiting factor and pointing to future refinements in data and modeling.

Abstract

Voice cloning is a prominent feature in personalized speech interfaces. A neural vocal cloning system can mimic someone's voice using just a few audio samples. Both speaker encoding and speaker adaptation are topics of research in the field of voice cloning. Speaker adaptation relies on fine-tuning a multi-speaker generative model, which involves training a separate model to infer a new speaker embedding used for speaker encoding. Both methods can achieve excellent performance, even with a small number of cloning audios, in terms of the speech's naturalness and similarity to the original speaker. Speaker encoding approaches are more appropriate for low-resource deployment since they require significantly less memory and have a faster cloning time than speaker adaption, which can offer slightly greater naturalness and similarity. The main goal is to create a vocal cloning system that produces audio output with a Nepali accent or that sounds like Nepali. For the further advancement of TTS, the idea of transfer learning was effectively used to address several issues that were encountered in the development of this system, including the poor audio quality and the lack of available data.

Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language

TL;DR

, EER

) and improved perceptual quality (MOS naturalness

, similarity

; PESQ

val.,

test). Overall, the system demonstrates feasible Nepali voice cloning under data constraints, with strong embedding alignment and perceptual scores, while highlighting dataset quality as a limiting factor and pointing to future refinements in data and modeling.

Abstract

Paper Structure (30 sections, 4 equations, 10 figures, 4 tables)

This paper contains 30 sections, 4 equations, 10 figures, 4 tables.

INTRODUCTION
Methodology
Proposed Method:
Nepali Speech Corpus Creation
Data-preprocessing
Encoder:
Sythensizer:
Architecture
Alignment Plots
Mel-Spectrogram
Vocoder:
Transfer Learning:
Result
Training Analysis
Encoder Training
...and 15 more sections

Figures (10)

Figure 1: Block Diagram Nepali Voice Cloning
Figure 2: Structure of speech Corpus
Figure 3: Preprocessing step for synthesizer model
Figure 4: Example content of train.txt
Figure 5: U-Map projection at different stages of training
...and 5 more figures

Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language

TL;DR

Abstract

Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language

Authors

TL;DR

Abstract

Table of Contents

Figures (10)