Table of Contents
Fetching ...

Context-Aware Whisper for Arabic ASR Under Linguistic Varieties

Bashar Talafha, Amin Abu Alhassan, Muhammad Abdul-Mageed

TL;DR

This work tackles the challenge of dialectal Arabic ASR with minimal retraining by introducing context-aware decoding for Whisper. It develops two complementary approaches—prompt-based and prefix-based context integration—plus techniques like first-pass prompts, retrieved text, and voice-cloned context to guide transcription. Across multiple datasets covering MSA and nine dialectal conditions, the method yields substantial reductions in WER and CER, demonstrating reduced hallucinations and improved robustness to speaker variation. The study highlights practical considerations for deployment, including computational overhead and prompt-length constraints, and points to promising directions for future prompting strategies and code-switching support.

Abstract

Low-resource ASR remains a challenging problem, especially for languages like Arabic that exhibit wide dialectal variation and limited labeled data. We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining. Our methods include decoder prompting with first-pass transcriptions or retrieved utterances, and encoder prefixing using speech synthesized in the target speaker's voice. We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. Evaluated on nine Arabic linguistic conditions, our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech, significantly mitigating hallucinations and speaker mismatch.

Context-Aware Whisper for Arabic ASR Under Linguistic Varieties

TL;DR

This work tackles the challenge of dialectal Arabic ASR with minimal retraining by introducing context-aware decoding for Whisper. It develops two complementary approaches—prompt-based and prefix-based context integration—plus techniques like first-pass prompts, retrieved text, and voice-cloned context to guide transcription. Across multiple datasets covering MSA and nine dialectal conditions, the method yields substantial reductions in WER and CER, demonstrating reduced hallucinations and improved robustness to speaker variation. The study highlights practical considerations for deployment, including computational overhead and prompt-length constraints, and points to promising directions for future prompting strategies and code-switching support.

Abstract

Low-resource ASR remains a challenging problem, especially for languages like Arabic that exhibit wide dialectal variation and limited labeled data. We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining. Our methods include decoder prompting with first-pass transcriptions or retrieved utterances, and encoder prefixing using speech synthesized in the target speaker's voice. We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. Evaluated on nine Arabic linguistic conditions, our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech, significantly mitigating hallucinations and speaker mismatch.

Paper Structure

This paper contains 38 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Context-aware adaptation strategies: (A) Prompt-based, (B) Prefix-based. We experiment with multiple feature extraction methods and compare each method's performance (see Section \ref{['similar-as-prompt']}). The decoder inputs follow Whisper's multitask training format and include: Prev: previous text tokens, SOT: start of transcript, AR: language tag set to Arabic, and TRAN: transcription mode tag. These tokens configure Whisper’s decoding behavior and enable contextual prompting.
  • Figure 2: Comparison between the performance of Whisper on real speech vs. TTS-generated speech across different language settings (sample size=1000).