Automatic Restoration of Diacritics for Speech Data Sets
Sara Shatnawi, Sawsan Alqahtani, Hanan Aldarmaki
TL;DR
This paper tackles the poor generalization of text-based Arabic diacritization when applied to speech data. It introduces a Text+ASR framework that fuses undiacritized text with provisional diacritics generated by a Whisper ASR model fine-tuned on diacritized speech, using cross-attention to integrate modalities. The authors demonstrate significant reductions in diacritic error rate (DER), reporting up to a 45% relative improvement over strong text-only baselines, particularly for Classical Arabic speech; LSTM variants generally outperform Transformer variants in this setup. The approach requires relatively small diacritized speech datasets yet enables robust diacritization that can facilitate larger diacritized corpora for ASR and TTS applications, though generalization to other Arabic variants remains challenging due to data scarcity and domain shifts.
Abstract
Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for diacritic restoration models. The proposed framework consistently improves diacritic restoration performance compared to text-only baselines. Our results highlight the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration.
