EthioMT: Parallel Corpus for Low-resource Ethiopian Languages
Atnafu Lambebo Tonja, Olga Kolesnikova, Alexander Gelbukh, Jugal Kalita
TL;DR
EthioMT addresses the scarcity of parallel data for Ethiopian languages by assembling a parallel English corpus across 15 languages and a benchmark for four better-resourced Ethiopian languages. The dataset combines religious-domain and public-domain sources, with sentence alignment to English and 70/10/20 train/dev/test splits, enabling MT experiments. Evaluation compares a transformer trained from scratch to a fine-tuned multilingual model (M2M100-48), with fine-tuning generally outperforming the baseline, especially for languages with larger corpora and diverse domains. The work provides open-source data to accelerate research in low-resource Ethiopian languages and establishes a baseline for future MT improvements in this multilingual setting.
Abstract
Recent research in natural language processing (NLP) has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available at all. NLP in Ethiopian languages suffers from the same issues due to the unavailability of publicly accessible datasets for NLP tasks, including MT. To help the research community and foster research for Ethiopian languages, we introduce EthioMT -- a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
