Context-Aware Whisper for Arabic ASR Under Linguistic Varieties
Bashar Talafha, Amin Abu Alhassan, Muhammad Abdul-Mageed
TL;DR
This work tackles the challenge of dialectal Arabic ASR with minimal retraining by introducing context-aware decoding for Whisper. It develops two complementary approaches—prompt-based and prefix-based context integration—plus techniques like first-pass prompts, retrieved text, and voice-cloned context to guide transcription. Across multiple datasets covering MSA and nine dialectal conditions, the method yields substantial reductions in WER and CER, demonstrating reduced hallucinations and improved robustness to speaker variation. The study highlights practical considerations for deployment, including computational overhead and prompt-length constraints, and points to promising directions for future prompting strategies and code-switching support.
Abstract
Low-resource ASR remains a challenging problem, especially for languages like Arabic that exhibit wide dialectal variation and limited labeled data. We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining. Our methods include decoder prompting with first-pass transcriptions or retrieved utterances, and encoder prefixing using speech synthesized in the target speaker's voice. We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. Evaluated on nine Arabic linguistic conditions, our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech, significantly mitigating hallucinations and speaker mismatch.
