Table of Contents
Fetching ...

RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations

Ashwin Sankar, Yoach Lacombe, Sherry Thomas, Praveen Srinivasa Varadhan, Sanchit Gandhi, Mitesh M Khapra

TL;DR

Rasmalai introduces a large-scale, richly annotated corpus spanning 24 languages (23 Indian languages + English) to enable text-description-guided TTS and addresses the data gap hindering expressive multilingual synthesis. By aggregating existing datasets and generating structured text descriptions through LLMs from multifaceted attributes, Rasmalai provides a foundation for training IndicParlerTTS, the first open-source multilingual TTS guided by text descriptions for Indian languages. Evaluations show high naturalness, strong instruction adherence, and robust expressive control, including effective zero-shot and cross-lingual transfer. The work delivers not only state-of-the-art controllable expressive synthesis for Indian languages but also an open-resource platform to accelerate research and real-world deployment in multilingual TTS. Overall, Rasmalai and IndicParlerTTS establish a new benchmark for scalable, expressive, text-prompted TTS in low-resource and multilingual settings, with broad implications for accessibility and language technology.

Abstract

We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages. Systematic evaluation demonstrates its ability to generate high-quality speech for named speakers, reliably follow text descriptions and accurately synthesize specified attributes. Additionally, it effectively transfers expressive characteristics both within and across languages. IndicParlerTTS consistently achieves strong performance across these evaluations, setting a new standard for controllable multilingual expressive speech synthesis in Indian languages.

RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations

TL;DR

Rasmalai introduces a large-scale, richly annotated corpus spanning 24 languages (23 Indian languages + English) to enable text-description-guided TTS and addresses the data gap hindering expressive multilingual synthesis. By aggregating existing datasets and generating structured text descriptions through LLMs from multifaceted attributes, Rasmalai provides a foundation for training IndicParlerTTS, the first open-source multilingual TTS guided by text descriptions for Indian languages. Evaluations show high naturalness, strong instruction adherence, and robust expressive control, including effective zero-shot and cross-lingual transfer. The work delivers not only state-of-the-art controllable expressive synthesis for Indian languages but also an open-resource platform to accelerate research and real-world deployment in multilingual TTS. Overall, Rasmalai and IndicParlerTTS establish a new benchmark for scalable, expressive, text-prompted TTS in low-resource and multilingual settings, with broad implications for accessibility and language technology.

Abstract

We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages. Systematic evaluation demonstrates its ability to generate high-quality speech for named speakers, reliably follow text descriptions and accurately synthesize specified attributes. Additionally, it effectively transfers expressive characteristics both within and across languages. IndicParlerTTS consistently achieves strong performance across these evaluations, setting a new standard for controllable multilingual expressive speech synthesis in Indian languages.

Paper Structure

This paper contains 13 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Comparison of durations for Rasmalai-pretrain, Rasmalai-finetune and IndicVoices-R across 24 languages.
  • Figure 2: Confusion plot for Perceptual Emotion Classification