Table of Contents
Fetching ...

VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

Ashishkumar Gudmalwar, Nirmesh Shah, Sai Akarsh, Pankaj Wasnik, Rajiv Ratn Shah

TL;DR

VECL-TTS addresses cross-lingual TTS for automatic dubbing by jointly controlling voice identity and emotional style across languages. It extends YourTTS with dual multilingual embeddings for speakers and emotions, a stochastic duration predictor conditioned on these embeddings, and a wav2vec2-based content loss to stabilize pronunciation after cross-lingual transfer; the model is trained with explicit ECL and SCL losses to preserve target emotions and speaker traits. Evaluations on English, Hindi, Telugu, and Marathi show improved emotion similarity and competitive naturalness, with objective embedding-based metrics corroborating subjective gains and an 8.83% average relative improvement over SOTA. The approach offers a practical pathway for high-quality multilingual dubbing, enabling expressive, speaker-consistent cross-language synthesis with robust pronunciation.

Abstract

Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative improvement of 8.83\% compared to the state-of-the-art (SOTA) methods on a database comprising English and three Indian languages (Hindi, Telugu, and Marathi).

VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

TL;DR

VECL-TTS addresses cross-lingual TTS for automatic dubbing by jointly controlling voice identity and emotional style across languages. It extends YourTTS with dual multilingual embeddings for speakers and emotions, a stochastic duration predictor conditioned on these embeddings, and a wav2vec2-based content loss to stabilize pronunciation after cross-lingual transfer; the model is trained with explicit ECL and SCL losses to preserve target emotions and speaker traits. Evaluations on English, Hindi, Telugu, and Marathi show improved emotion similarity and competitive naturalness, with objective embedding-based metrics corroborating subjective gains and an 8.83% average relative improvement over SOTA. The approach offers a practical pathway for high-quality multilingual dubbing, enabling expressive, speaker-consistent cross-language synthesis with robust pronunciation.

Abstract

Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative improvement of 8.83\% compared to the state-of-the-art (SOTA) methods on a database comprising English and three Indian languages (Hindi, Telugu, and Marathi).
Paper Structure (15 sections, 3 equations, 2 figures, 1 table)

This paper contains 15 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: The bock diagram of proposed VECL-TTS model. Contributions are highlighted via red dotted box.
  • Figure 2: Visualization of Mel spectrogram and pitch variation for ground truth, proposed VECL-TTS and YourTTS generated speech.