Table of Contents
Fetching ...

Mismatching-Aware Unsupervised Translation Quality Estimation For Low-Resource Languages

Fatemeh Azadi, Heshaam Faili, Mohammad Javad Dousti

TL;DR

This work presents XLMRScore, an unsupervised QE metric tailored for low-resource languages by repurposing cross-lingual embeddings from XLM-R and addressing two key issues: untranslated tokens and mismatching during greedy word matching. It introduces two mitigation strategies—replacing untranslated tokens with UNK using a target-language vocabulary, and cross-lingual alignment via a contrastive loss on word alignments—plus a multilingual fine-tuning regime that further improves performance. Empirical results on four WMT21 low-resource language pairs and a newly released English→Persian En-Fa dataset show substantial gains over the base XLMRScore, with the best configurations approaching supervised baselines in zero-shot settings and outperforming other unsupervised methods on average. The work also delivers a new En-Fa QE test set, analyzes explainability of QE outputs, and outlines directions for stronger cross-lingual representations and integration with supervised QE components.

Abstract

Translation Quality Estimation (QE) is the task of predicting the quality of machine translation (MT) output without any reference. This task has gained increasing attention as an important component in the practical applications of MT. In this paper, we first propose XLMRScore, which is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. This metric can be used as a simple unsupervised QE method, nevertheless facing two issues: firstly, the untranslated tokens leading to unexpectedly high translation scores, and secondly, the issue of mismatching errors between source and hypothesis tokens when applying the greedy matching in XLMRScore. To mitigate these issues, we suggest replacing untranslated words with the unknown token and the cross-lingual alignment of the pre-trained model to represent aligned words closer to each other, respectively. We evaluate the proposed method on four low-resource language pairs of the WMT21 QE shared task, as well as a new English$\rightarrow$Persian (En-Fa) test dataset introduced in this paper. Experiments show that our method could get comparable results with the supervised baseline for two zero-shot scenarios, i.e., with less than 0.01 difference in Pearson correlation, while outperforming unsupervised rivals in all the low-resource language pairs for above 8%, on average.

Mismatching-Aware Unsupervised Translation Quality Estimation For Low-Resource Languages

TL;DR

This work presents XLMRScore, an unsupervised QE metric tailored for low-resource languages by repurposing cross-lingual embeddings from XLM-R and addressing two key issues: untranslated tokens and mismatching during greedy word matching. It introduces two mitigation strategies—replacing untranslated tokens with UNK using a target-language vocabulary, and cross-lingual alignment via a contrastive loss on word alignments—plus a multilingual fine-tuning regime that further improves performance. Empirical results on four WMT21 low-resource language pairs and a newly released English→Persian En-Fa dataset show substantial gains over the base XLMRScore, with the best configurations approaching supervised baselines in zero-shot settings and outperforming other unsupervised methods on average. The work also delivers a new En-Fa QE test set, analyzes explainability of QE outputs, and outlines directions for stronger cross-lingual representations and integration with supervised QE components.

Abstract

Translation Quality Estimation (QE) is the task of predicting the quality of machine translation (MT) output without any reference. This task has gained increasing attention as an important component in the practical applications of MT. In this paper, we first propose XLMRScore, which is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. This metric can be used as a simple unsupervised QE method, nevertheless facing two issues: firstly, the untranslated tokens leading to unexpectedly high translation scores, and secondly, the issue of mismatching errors between source and hypothesis tokens when applying the greedy matching in XLMRScore. To mitigate these issues, we suggest replacing untranslated words with the unknown token and the cross-lingual alignment of the pre-trained model to represent aligned words closer to each other, respectively. We evaluate the proposed method on four low-resource language pairs of the WMT21 QE shared task, as well as a new EnglishPersian (En-Fa) test dataset introduced in this paper. Experiments show that our method could get comparable results with the supervised baseline for two zero-shot scenarios, i.e., with less than 0.01 difference in Pearson correlation, while outperforming unsupervised rivals in all the low-resource language pairs for above 8%, on average.
Paper Structure (20 sections, 4 equations, 4 figures, 7 tables)

This paper contains 20 sections, 4 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: An example of the mismatching issue, where the Persian token "<بالا>" is matched to the English token "The" instead of the correct token "up" using the base pre-trained model (a). After fine-tuning the model using the cross-lingual alignment strategy, it could correctly match "<بالا>" and "up" to each other (b).
  • Figure 2: Distribution of the HTER scores in the En-Fa test set.
  • Figure 3: Pearson correlation for our base model based on the number of sentences
  • Figure 4: The AER Results across different layers for En-Fa (a) and En-Hi (b) test sets