Automatic Restoration of Diacritics for Speech Data Sets

Sara Shatnawi; Sawsan Alqahtani; Hanan Aldarmaki

Automatic Restoration of Diacritics for Speech Data Sets

Sara Shatnawi, Sawsan Alqahtani, Hanan Aldarmaki

TL;DR

This paper tackles the poor generalization of text-based Arabic diacritization when applied to speech data. It introduces a Text+ASR framework that fuses undiacritized text with provisional diacritics generated by a Whisper ASR model fine-tuned on diacritized speech, using cross-attention to integrate modalities. The authors demonstrate significant reductions in diacritic error rate (DER), reporting up to a 45% relative improvement over strong text-only baselines, particularly for Classical Arabic speech; LSTM variants generally outperform Transformer variants in this setup. The approach requires relatively small diacritized speech datasets yet enables robust diacritization that can facilitate larger diacritized corpora for ASR and TTS applications, though generalization to other Arabic variants remains challenging due to data scarcity and domain shifts.

Abstract

Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for diacritic restoration models. The proposed framework consistently improves diacritic restoration performance compared to text-only baselines. Our results highlight the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration.

Automatic Restoration of Diacritics for Speech Data Sets

TL;DR

Abstract

Paper Structure (26 sections, 3 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 3 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Text-based diacritic restoration
Speech-based diacritic restoration
Proposed Framework
Notation:
Sequence Encoder Architecture
Transformer Model:
bi-LSTM Model:
Sliding Window Inference
Experimental Settings
Datasets
Model Setup
ASR Model
Baselines
...and 11 more sections

Figures (3)

Figure 1: The proposed diacritic restoration model takes speech utterances and their undiacritized transcripts as input, and produces diacritized text. Left: text-only diacritic restoration, which can be any sequence labeling model. Full figure: Proposed framework, which includes a speech recognition model pre-trained to produce diacritized hypotheses, and a cross-attention mechanism to fuse the two modalities.
Figure 2: Top: Basic transformer self attention and prediction regions with sliding window mechanism for inference. Bottom: Cross-attention region from ASR prediction used in each sub-sequence.
Figure 3: Cross-attention weights between the undiacritized input (black) and ASR text (magenta).

Automatic Restoration of Diacritics for Speech Data Sets

TL;DR

Abstract

Automatic Restoration of Diacritics for Speech Data Sets

Authors

TL;DR

Abstract

Table of Contents

Figures (3)