Table of Contents
Fetching ...

Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies

Srija Anand, Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra

TL;DR

This work addresses the persistent OOV problem in low-resource TTS for Hindi and Tamil by introducing IndicOOV, a broad OOV benchmark across real-world categories. It demonstrates a cost-effective approach: collecting OOV words from volunteers to augment training data, and fine-tuning state-of-the-art TTS models (FastPitch and VITS) on this data. The results show substantial OOV intelligibility gains (up to ~41% relative reductions) while preserving voice quality and in-domain performance. The approach offers a practical pathway to improve OOV handling in deployed Indian-language TTS systems without high data collection costs, thereby enhancing accessibility and effectiveness in real-world applications.

Abstract

Publicly available TTS datasets for low-resource languages like Hindi and Tamil typically contain 10-20 hours of data, leading to poor vocabulary coverage. This limitation becomes evident in downstream applications where domain-specific vocabulary coupled with frequent code-mixing with English, results in many OOV words. To highlight this problem, we create a benchmark containing OOV words from several real-world applications. Indeed, state-of-the-art Hindi and Tamil TTS systems perform poorly on this OOV benchmark, as indicated by intelligibility tests. To improve the model's OOV performance, we propose a low-effort and economically viable strategy to obtain more training data. Specifically, we propose using volunteers as opposed to high quality voice artists to record words containing character bigrams unseen in the training data. We show that using such inexpensive data, the model's performance improves on OOV words, while not affecting voice quality and in-domain performance.

Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies

TL;DR

This work addresses the persistent OOV problem in low-resource TTS for Hindi and Tamil by introducing IndicOOV, a broad OOV benchmark across real-world categories. It demonstrates a cost-effective approach: collecting OOV words from volunteers to augment training data, and fine-tuning state-of-the-art TTS models (FastPitch and VITS) on this data. The results show substantial OOV intelligibility gains (up to ~41% relative reductions) while preserving voice quality and in-domain performance. The approach offers a practical pathway to improve OOV handling in deployed Indian-language TTS systems without high data collection costs, thereby enhancing accessibility and effectiveness in real-world applications.

Abstract

Publicly available TTS datasets for low-resource languages like Hindi and Tamil typically contain 10-20 hours of data, leading to poor vocabulary coverage. This limitation becomes evident in downstream applications where domain-specific vocabulary coupled with frequent code-mixing with English, results in many OOV words. To highlight this problem, we create a benchmark containing OOV words from several real-world applications. Indeed, state-of-the-art Hindi and Tamil TTS systems perform poorly on this OOV benchmark, as indicated by intelligibility tests. To improve the model's OOV performance, we propose a low-effort and economically viable strategy to obtain more training data. Specifically, we propose using volunteers as opposed to high quality voice artists to record words containing character bigrams unseen in the training data. We show that using such inexpensive data, the model's performance improves on OOV words, while not affecting voice quality and in-domain performance.
Paper Structure (18 sections, 2 figures, 4 tables)

This paper contains 18 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Intelligibility Error Rates (%) of Indian TTS models across Hindi (Left) and Tamil (Right) on the IndicOOV benchmark shows that models consistently perform worse for OOV words compared to IV words.
  • Figure 2: Intelligibility Error Rates (%) of Indian TTS models across Hindi (Left) and Tamil (Right) on the IndicOOV benchmark averaged across categories shows that models consistently perform worse for OOV words compared to IV words.