Table of Contents
Fetching ...

Arabic Text Diacritization In The Age Of Transfer Learning: Token Classification Is All You Need

Abderrahman Skiredj, Ismail Berrada

TL;DR

This work tackles Arabic Text Diacritization (ATD) by introducing PTCAD, a two-phase framework that reframes ATD as a token classification problem. Phase 1 pre-finetunes a BERT-like model on linguistically relevant tasks (CA text, POS tagging, segmentation, and ATD as MLM) to build rich contextual representations, while Phase 2 treats ATD as token classification with a mask-based input transformation, akin to NER. Evaluations on Abbad and Fadel Tashkeela datasets show state-of-the-art performance, with about a 20% relative reduction in WER compared to prior SOTA and superior results versus GPT-4 on ATD tasks. Ablation studies confirm the critical role of multi-task pre-finetuning, and error analyses reveal key bottlenecks such as limited context, truncation, and linguistic complexity, guiding future enhancements. Overall, PTCAD demonstrates how task-aligned pre-training and a token-classification framing can substantially improve diacritization in Arabic, offering practical benefits for downstream NLP applications.

Abstract

Automatic diacritization of Arabic text involves adding diacritical marks (diacritics) to the text. This task poses a significant challenge with noteworthy implications for computational processing and comprehension. In this paper, we introduce PTCAD (Pre-FineTuned Token Classification for Arabic Diacritization, a novel two-phase approach for the Arabic Text Diacritization task. PTCAD comprises a pre-finetuning phase and a finetuning phase, treating Arabic Text Diacritization as a token classification task for pre-trained models. The effectiveness of PTCAD is demonstrated through evaluations on two benchmark datasets derived from the Tashkeela dataset, where it achieves state-of-the-art results, including a 20\% reduction in Word Error Rate (WER) compared to existing benchmarks and superior performance over GPT-4 in ATD tasks.

Arabic Text Diacritization In The Age Of Transfer Learning: Token Classification Is All You Need

TL;DR

This work tackles Arabic Text Diacritization (ATD) by introducing PTCAD, a two-phase framework that reframes ATD as a token classification problem. Phase 1 pre-finetunes a BERT-like model on linguistically relevant tasks (CA text, POS tagging, segmentation, and ATD as MLM) to build rich contextual representations, while Phase 2 treats ATD as token classification with a mask-based input transformation, akin to NER. Evaluations on Abbad and Fadel Tashkeela datasets show state-of-the-art performance, with about a 20% relative reduction in WER compared to prior SOTA and superior results versus GPT-4 on ATD tasks. Ablation studies confirm the critical role of multi-task pre-finetuning, and error analyses reveal key bottlenecks such as limited context, truncation, and linguistic complexity, guiding future enhancements. Overall, PTCAD demonstrates how task-aligned pre-training and a token-classification framing can substantially improve diacritization in Arabic, offering practical benefits for downstream NLP applications.

Abstract

Automatic diacritization of Arabic text involves adding diacritical marks (diacritics) to the text. This task poses a significant challenge with noteworthy implications for computational processing and comprehension. In this paper, we introduce PTCAD (Pre-FineTuned Token Classification for Arabic Diacritization, a novel two-phase approach for the Arabic Text Diacritization task. PTCAD comprises a pre-finetuning phase and a finetuning phase, treating Arabic Text Diacritization as a token classification task for pre-trained models. The effectiveness of PTCAD is demonstrated through evaluations on two benchmark datasets derived from the Tashkeela dataset, where it achieves state-of-the-art results, including a 20\% reduction in Word Error Rate (WER) compared to existing benchmarks and superior performance over GPT-4 in ATD tasks.
Paper Structure (20 sections, 10 equations, 3 figures, 11 tables)

This paper contains 20 sections, 10 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Global overview of PTCAD approach
  • Figure 2: Sample input transformation for ATD
  • Figure 3: Histogram of DER and WER for the Abbad Tashkeela and Fadel Tashkeela benchmarks