Table of Contents
Fetching ...

ASR Under Noise: Exploring Robustness for Sundanese and Javanese

Salsabila Zahirah Pranida, Muhammad Cendekia Airlangga, Rifo Ahmad Genadi, Shady Shehata

TL;DR

This work evaluates Whisper-based ASR for Javanese and Sundanese under noisy conditions, addressing robustness gaps for Indonesian regional languages. It systematically fine-tunes Whisper variants and compares clean, SpecAugment, and noise-aware training with synthetic AudioSet noises, demonstrating that noise-aware approaches yield substantial robustness gains, especially for larger models. Error analysis reveals language-specific patterns, with Sundanese more prone to vowel and diacritic-related errors and Javanese to consonant digraph issues. The study provides a reproducible pipeline and highlights directions for dialect-aware augmentation and speech enhancement to improve real-world performance.

Abstract

We investigate the robustness of Whisper-based automatic speech recognition (ASR) models for two major Indonesian regional languages: Javanese and Sundanese. While recent work has demonstrated strong ASR performance under clean conditions, their effectiveness in noisy environments remains unclear. To address this, we experiment with multiple training strategies, including synthetic noise augmentation and SpecAugment, and evaluate performance across a range of signal-to-noise ratios (SNRs). Our results show that noise-aware training substantially improves robustness, particularly for larger Whisper models. A detailed error analysis further reveals language-specific challenges, highlighting avenues for future improvements

ASR Under Noise: Exploring Robustness for Sundanese and Javanese

TL;DR

This work evaluates Whisper-based ASR for Javanese and Sundanese under noisy conditions, addressing robustness gaps for Indonesian regional languages. It systematically fine-tunes Whisper variants and compares clean, SpecAugment, and noise-aware training with synthetic AudioSet noises, demonstrating that noise-aware approaches yield substantial robustness gains, especially for larger models. Error analysis reveals language-specific patterns, with Sundanese more prone to vowel and diacritic-related errors and Javanese to consonant digraph issues. The study provides a reproducible pipeline and highlights directions for dialect-aware augmentation and speech enhancement to improve real-world performance.

Abstract

We investigate the robustness of Whisper-based automatic speech recognition (ASR) models for two major Indonesian regional languages: Javanese and Sundanese. While recent work has demonstrated strong ASR performance under clean conditions, their effectiveness in noisy environments remains unclear. To address this, we experiment with multiple training strategies, including synthetic noise augmentation and SpecAugment, and evaluate performance across a range of signal-to-noise ratios (SNRs). Our results show that noise-aware training substantially improves robustness, particularly for larger Whisper models. A detailed error analysis further reveals language-specific challenges, highlighting avenues for future improvements

Paper Structure

This paper contains 32 sections, 2 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: (Left) Training and evaluation pipeline for Whisper-based ASR models. Each fine-tuned model is evaluated on clean and noisy versions of the OpenSLR test set. (Right) Examples of noisy transcriptions in Javanese and Sundanese using Whisper. The top boxes show spoken utterances with noise; the bottom boxes show the corresponding ASR outputs, demonstrating significantly degraded quality under noisy conditions.
  • Figure 2: WER performance of Large-v3 Whisper across different SNR levels for Javanese and Sundanese. Models trained with NoiseTrain consistently outperform others under low-SNR conditions. Higher SNR values indicate cleaner audio.
  • Figure 3: WER performance of Whisper variants across different SNR levels for Javanese and Sundanese: (a) Tiny - Javanese, (b) Medium - Javanese, (c) Large-v3-Turbo - Javanese, (d) Tiny - Sundanese, (e) Medium - Sundanese, (f) Large-v3-Turbo - Sundanese.