Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data
Kurt Micallef, Nizar Habash, Claudia Borg
TL;DR
The paper addresses the challenge of leveraging Arabic NLP resources to support Maltese by proposing transliteration- and translation-based data augmentation. It introduces two transliteration systems, CharTx and MorphTx, designed to align Arabic inputs with Maltese orthography and morphology, and evaluates their impact across Maltese NLP tasks using monolingual and multilingual BERT-based models. Results show that Arabic augmentation can improve performance, with the best gains arising from model type and augmentation strategy, and that cascaded fine-tuning across Arabic variants yields further improvements for multilingual models. The work demonstrates the viability of cross-lingual augmentation for a low-resource language and suggests direction for unsupervised alignment and broader application to Maltese NLP tasks.
Abstract
Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.
