Table of Contents
Fetching ...

Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity

Eric Khiu, Hasti Toossi, David Anugraha, Jinyu Liu, Jiaxu Li, Juan Armando Parra Flores, Leandro Acros Roman, A. Seza Doğruöz, En-Shiun Annie Lee

TL;DR

The paper tackles MT performance prediction for low-resource languages without costly pre-training, focusing on three factors: fine-tuning corpus size, domain similarity, and language similarity. It applies classical regression with partitions to quantify each factor's predictive power, finding domain similarity quantified by Jensen–Shannon divergence $JSD$ to be the strongest predictor, while fine-tuning size and language distance contribute secondary or dataset-dependent effects. The framework incorporates normality and homoscedasticity checks and uses multiple feature-ranking approaches (Pearson correlation, regression weights, and Random Forest) to establish robust, domain-aware insights across five South Asian LRLs with mBART; results show domain similarity reliably predicts $spBLEU$, with language features often weak predictors due to limited linguistic diversity. These findings enable performance estimation for low-resource MT without expensive fine-tuning, while highlighting data-domain biases and ethical considerations for equitable language representation in NLP systems.

Abstract

Fine-tuning and testing a multilingual large language model is expensive and challenging for low-resource languages (LRLs). While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors: the size of the fine-tuning corpus, the domain similarity between fine-tuning and testing corpora, and the language similarity between source and target languages. We employ classical regression models to assess how these factors impact the model's performance. Our results indicate that domain similarity has the most critical impact on predicting the performance of Machine Translation models.

Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity

TL;DR

The paper tackles MT performance prediction for low-resource languages without costly pre-training, focusing on three factors: fine-tuning corpus size, domain similarity, and language similarity. It applies classical regression with partitions to quantify each factor's predictive power, finding domain similarity quantified by Jensen–Shannon divergence to be the strongest predictor, while fine-tuning size and language distance contribute secondary or dataset-dependent effects. The framework incorporates normality and homoscedasticity checks and uses multiple feature-ranking approaches (Pearson correlation, regression weights, and Random Forest) to establish robust, domain-aware insights across five South Asian LRLs with mBART; results show domain similarity reliably predicts , with language features often weak predictors due to limited linguistic diversity. These findings enable performance estimation for low-resource MT without expensive fine-tuning, while highlighting data-domain biases and ethical considerations for equitable language representation in NLP systems.

Abstract

Fine-tuning and testing a multilingual large language model is expensive and challenging for low-resource languages (LRLs). While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors: the size of the fine-tuning corpus, the domain similarity between fine-tuning and testing corpora, and the language similarity between source and target languages. We employ classical regression models to assess how these factors impact the model's performance. Our results indicate that domain similarity has the most critical impact on predicting the performance of Machine Translation models.
Paper Structure (36 sections, 1 equation, 4 figures, 7 tables)

This paper contains 36 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Regression plots using best predictor functions for size and domain on best partitioning schemes.
  • Figure 2: Boxplots of residuals using best predictor functions for size and domain on some partitioning schemes.
  • Figure 3: Scatter Plots of spBLEU with respect to size using different partitioning schemes.
  • Figure 4: Scatter Plot of spBLEU with respect to JSD, partitioned by target language.