Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity
Eric Khiu, Hasti Toossi, David Anugraha, Jinyu Liu, Jiaxu Li, Juan Armando Parra Flores, Leandro Acros Roman, A. Seza Doğruöz, En-Shiun Annie Lee
TL;DR
The paper tackles MT performance prediction for low-resource languages without costly pre-training, focusing on three factors: fine-tuning corpus size, domain similarity, and language similarity. It applies classical regression with partitions to quantify each factor's predictive power, finding domain similarity quantified by Jensen–Shannon divergence $JSD$ to be the strongest predictor, while fine-tuning size and language distance contribute secondary or dataset-dependent effects. The framework incorporates normality and homoscedasticity checks and uses multiple feature-ranking approaches (Pearson correlation, regression weights, and Random Forest) to establish robust, domain-aware insights across five South Asian LRLs with mBART; results show domain similarity reliably predicts $spBLEU$, with language features often weak predictors due to limited linguistic diversity. These findings enable performance estimation for low-resource MT without expensive fine-tuning, while highlighting data-domain biases and ethical considerations for equitable language representation in NLP systems.
Abstract
Fine-tuning and testing a multilingual large language model is expensive and challenging for low-resource languages (LRLs). While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors: the size of the fine-tuning corpus, the domain similarity between fine-tuning and testing corpora, and the language similarity between source and target languages. We employ classical regression models to assess how these factors impact the model's performance. Our results indicate that domain similarity has the most critical impact on predicting the performance of Machine Translation models.
