Table of Contents
Fetching ...

Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

Srihari Bandarupalli, Bhavana Akkiraju, Charan Devarakonda, Vamsiraghusimha Narsinga, Anil Kumar Vuppala

TL;DR

Addresses ASR for morphologically complex, low-resource languages by leveraging cross-lingual unlabeled data. Develops a 300M-parameter model trained on a 3,000-hour multilingual unlabeled corpus with morphologically-aware SentencePiece tokenization. Demonstrates that targeted continual pretraining and data relevance can match or approach state-of-the-art performance achieved by much larger models on Persian, Arabic, and Urdu, despite fewer parameters and less labeled data. Provides a practical, scalable pathway toward inclusive ASR across underrepresented languages without reliance on massive proprietary data or compute.

Abstract

Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.

Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

TL;DR

Addresses ASR for morphologically complex, low-resource languages by leveraging cross-lingual unlabeled data. Develops a 300M-parameter model trained on a 3,000-hour multilingual unlabeled corpus with morphologically-aware SentencePiece tokenization. Demonstrates that targeted continual pretraining and data relevance can match or approach state-of-the-art performance achieved by much larger models on Persian, Arabic, and Urdu, despite fewer parameters and less labeled data. Provides a practical, scalable pathway toward inclusive ASR across underrepresented languages without reliance on massive proprietary data or compute.

Abstract

Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.

Paper Structure

This paper contains 20 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Systematic pipeline for constructing a robust, multilingual unlabeled speech corpus.
  • Figure 2: Overview of our experimental framework. The pipeline illustrates three training strategies: CS (Wav2Vec 2.0 Base trained from scratch), CP1 (XLS-R 300M with continuous pretraining), and CP2 (Wav2Vec 2.0 Large with continuous pretraining). All models undergo pretraining on our 3,000-hour multilingual corpus followed by language-specific fine-tuning with SentencePiece tokenization.