Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking
Mohammad Hossein Sameti, Sepehr Harfi Moridani, Ali Zarean, Hossein Sameti
TL;DR
The paper tackles the challenge of accent and dialect variability in ASR by introducing an accent-invariant framework that integrates accent classification with Grad-CAM-based spectrogram masking to suppress accent-specific cues. The approach generates masked spectrograms for data augmentation and fine-tunes a transformer-based ASR model, achieving improved robustness without changing model architecture. It also introduces the PDID dataset, providing a systematic Persian accent benchmark across 10 regional dialects. Experimental results on English and Persian show significant WER/CER reductions, especially for unseen accents, highlighting the method's potential for multilingual and low-resource ASR deployment.
Abstract
Pre-trained transformer-based models have significantly advanced automatic speech recognition (ASR), yet they remain sensitive to accent and dialectal variations, resulting in elevated word error rates (WER) in linguistically diverse languages such as English and Persian. To address this challenge, we propose an accent-invariant ASR framework that integrates accent and dialect classification into the recognition pipeline. Our approach involves training a spectrogram-based classifier to capture accent-specific cues, masking the regions most influential to its predictions, and using the masked spectrograms for data augmentation. This enhances the robustness of ASR models against accent variability. We evaluate the method using both English and Persian speech. For Persian, we introduce a newly collected dataset spanning multiple regional accents, establishing the first systematic benchmark for accent variation in Persian ASR that fills a critical gap in multilingual speech research and provides a foundation for future studies on low-resource, linguistically diverse languages. Experimental results with the Whisper model demonstrate that our masking and augmentation strategy yields substantial WER reductions in both English and Persian settings, confirming the effectiveness of the approach. This research advances the development of multilingual ASR systems that are resilient to accent and dialect diversity. Code and dataset are publicly available at: https://github.com/MH-Sameti/Accent_invariant_ASR
