Arabic Text Diacritization In The Age Of Transfer Learning: Token Classification Is All You Need
Abderrahman Skiredj, Ismail Berrada
TL;DR
This work tackles Arabic Text Diacritization (ATD) by introducing PTCAD, a two-phase framework that reframes ATD as a token classification problem. Phase 1 pre-finetunes a BERT-like model on linguistically relevant tasks (CA text, POS tagging, segmentation, and ATD as MLM) to build rich contextual representations, while Phase 2 treats ATD as token classification with a mask-based input transformation, akin to NER. Evaluations on Abbad and Fadel Tashkeela datasets show state-of-the-art performance, with about a 20% relative reduction in WER compared to prior SOTA and superior results versus GPT-4 on ATD tasks. Ablation studies confirm the critical role of multi-task pre-finetuning, and error analyses reveal key bottlenecks such as limited context, truncation, and linguistic complexity, guiding future enhancements. Overall, PTCAD demonstrates how task-aligned pre-training and a token-classification framing can substantially improve diacritization in Arabic, offering practical benefits for downstream NLP applications.
Abstract
Automatic diacritization of Arabic text involves adding diacritical marks (diacritics) to the text. This task poses a significant challenge with noteworthy implications for computational processing and comprehension. In this paper, we introduce PTCAD (Pre-FineTuned Token Classification for Arabic Diacritization, a novel two-phase approach for the Arabic Text Diacritization task. PTCAD comprises a pre-finetuning phase and a finetuning phase, treating Arabic Text Diacritization as a token classification task for pre-trained models. The effectiveness of PTCAD is demonstrated through evaluations on two benchmark datasets derived from the Tashkeela dataset, where it achieves state-of-the-art results, including a 20\% reduction in Word Error Rate (WER) compared to existing benchmarks and superior performance over GPT-4 in ATD tasks.
