Table of Contents
Fetching ...

YAD: Leveraging T5 for Improved Automatic Diacritization of Yorùbá Text

Akindele Michael Olawole, Jesujoba O. Alabi, Aderonke Busayo Sakpere, David I. Adelani

TL;DR

The paper addresses the lack of a standard benchmark for Yorùbá diacritization by introducing the YAD dataset and a Yorùbá-focused T5 pretraining pipeline. It demonstrates that larger training data and bigger models improve diacritization performance, with AfriTeVa-V2-large and Oyo-T5 variants delivering strong results across BLEU and CHRF metrics. The work also analyzes the effects of training data sources and domain specificity, and releases code and data to support reproducibility and further research. Overall, YAD provides a practical benchmark and shows that data and model scaling significantly advance Yorùbá diacritization, with implications for deployable, lightweight diacritizers.

Abstract

In this work, we present Yorùbá automatic diacritization (YAD) benchmark dataset for evaluating Yorùbá diacritization systems. In addition, we pre-train text-to-text transformer, T5 model for Yorùbá and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yorùbá

YAD: Leveraging T5 for Improved Automatic Diacritization of Yorùbá Text

TL;DR

The paper addresses the lack of a standard benchmark for Yorùbá diacritization by introducing the YAD dataset and a Yorùbá-focused T5 pretraining pipeline. It demonstrates that larger training data and bigger models improve diacritization performance, with AfriTeVa-V2-large and Oyo-T5 variants delivering strong results across BLEU and CHRF metrics. The work also analyzes the effects of training data sources and domain specificity, and releases code and data to support reproducibility and further research. Overall, YAD provides a practical benchmark and shows that data and model scaling significantly advance Yorùbá diacritization, with implications for deployable, lightweight diacritizers.

Abstract

In this work, we present Yorùbá automatic diacritization (YAD) benchmark dataset for evaluating Yorùbá diacritization systems. In addition, we pre-train text-to-text transformer, T5 model for Yorùbá and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yorùbá
Paper Structure (10 sections, 5 tables)