Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities
Aulia Adila, Dessi Lestari, Ayu Purwarianti, Dipta Tanaya, Kurniawati Azizah, Sakriani Sakti
TL;DR
The paper tackles Indonesian ASR under diverse speech variabilities by constructing the IDSV dataset and evaluating two leading multilingual models, MMS and Whisper, through fine-tuning on Indonesian data. The main approach combines CTC-based fine-tuning for MMS and transformer-based fine-tuning for Whisper, with an additional KenLM-based language model incorporated for fair comparisons. Key findings show that Whisper fine-tuned (FT-whisper) delivers the best overall performance across variability groups, while MMS benefits from LM integration but can underperform on skewed training distributions; speaking style variability has the strongest impact on model accuracy. The work contributes a comprehensive Indonesian variability dataset and a robust cross-model evaluation framework, with practical implications for deploying ASR in real-world Indonesian contexts across formal/informal, read/spontaneous, and noisy environments.
Abstract
An ideal speech recognition model has the capability to transcribe speech accurately under various characteristics of speech signals, such as speaking style (read and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building such a model requires a significant amount of training data with diverse speech characteristics. Currently, Indonesian data is dominated by read, formal, and clean speech, leading to a scarcity of Indonesian data with other speech variabilities. To develop Indonesian automatic speech recognition (ASR), we present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper, as well as compiling a dataset comprising Indonesian speech with variabilities to facilitate our study. We further investigate the models' predictive ability to transcribe Indonesian speech data across different variability groups. The best results were achieved by the Whisper fine-tuned model across datasets with various characteristics, as indicated by the decrease in word error rate (WER) and character error rate (CER). Moreover, we found that speaking style variability affected model performance the most.
