Dialectal and Low-Resource Machine Translation for Aromanian
Alexandru-Iulius Jerpelea, Alina Rădoi, Sergiu Nisioi
TL;DR
This work tackles machine translation for Aromanian, an endangered language, by assembling a large, multi-genre parallel corpus (79k sentence pairs) and adapting state-of-the-art models. It introduces a language-agnostic Aromanian sentence embedder (LaBSE) and compares NLLB-based MT with several large language models, showing that NLLB variants consistently outperform LLM-based approaches for Aromanian translation. The study also explores orthography normalization (Cunia to DIARO), synthetic data augmentation, and an online translation system (AroTranslate) with CPU-friendly quantization. Results indicate that diversified data and multilingual transfer from Romanian improve Aromanian translation, supporting language preservation while revealing remaining challenges in dialectal variation and data representativeness.
Abstract
This paper presents the process of building a neural machine translation system with support for English, Romanian, and Aromanian - an endangered Eastern Romance language. The primary contribution of this research is twofold: (1) the creation of the most extensive Aromanian-Romanian parallel corpus to date, consisting of 79,000 sentence pairs, and (2) the development and comparative analysis of several machine translation models optimized for Aromanian. To accomplish this, we introduce a suite of auxiliary tools, including a language-agnostic sentence embedding model for text mining and automated evaluation, complemented by a diacritics conversion system for different writing standards. This research brings contributions to both computational linguistics and language preservation efforts by establishing essential resources for a historically under-resourced language. All datasets, trained models, and associated tools are public: https://huggingface.co/aronlp and https://arotranslate.com
