Table of Contents
Fetching ...

Quality Estimation with $k$-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation

Tu Anh Dinh, Tobias Palzer, Jan Niehues

TL;DR

This paper introduces kNN-QE, a model-specific, unsupervised quality estimation method that leverages k-nearest neighbors from MT training data to score MT outputs without labeled QE data. It also proposes an automatic evaluation approach for QE that uses reference-based metrics as gold standards, with MetricX-23 XL identified as the most robust for ranking QE metrics. Empirical results show kNN-QE outperforms a plain MT-probability baseline but remains behind supervised QE, and it benefits from small datastore sizes and limited neighbors. The automatic evaluation framework demonstrates strong correlation with human judgments across tasks and domains, supporting efficient internal QE development and cross-model comparability.

Abstract

Providing quality scores along with Machine Translation (MT) output, so-called reference-free Quality Estimation (QE), is crucial to inform users about the reliability of the translation. We propose a model-specific, unsupervised QE approach, termed $k$NN-QE, that extracts information from the MT model's training data using $k$-nearest neighbors. Measuring the performance of model-specific QE is not straightforward, since they provide quality scores on their own MT output, thus cannot be evaluated using benchmark QE test sets containing human quality scores on premade MT output. Therefore, we propose an automatic evaluation method that uses quality scores from reference-based metrics as gold standard instead of human-generated ones. We are the first to conduct detailed analyses and conclude that this automatic method is sufficient, and the reference-based MetricX-23 is best for the task.

Quality Estimation with $k$-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation

TL;DR

This paper introduces kNN-QE, a model-specific, unsupervised quality estimation method that leverages k-nearest neighbors from MT training data to score MT outputs without labeled QE data. It also proposes an automatic evaluation approach for QE that uses reference-based metrics as gold standards, with MetricX-23 XL identified as the most robust for ranking QE metrics. Empirical results show kNN-QE outperforms a plain MT-probability baseline but remains behind supervised QE, and it benefits from small datastore sizes and limited neighbors. The automatic evaluation framework demonstrates strong correlation with human judgments across tasks and domains, supporting efficient internal QE development and cross-model comparability.

Abstract

Providing quality scores along with Machine Translation (MT) output, so-called reference-free Quality Estimation (QE), is crucial to inform users about the reliability of the translation. We propose a model-specific, unsupervised QE approach, termed NN-QE, that extracts information from the MT model's training data using -nearest neighbors. Measuring the performance of model-specific QE is not straightforward, since they provide quality scores on their own MT output, thus cannot be evaluated using benchmark QE test sets containing human quality scores on premade MT output. Therefore, we propose an automatic evaluation method that uses quality scores from reference-based metrics as gold standard instead of human-generated ones. We are the first to conduct detailed analyses and conclude that this automatic method is sufficient, and the reference-based MetricX-23 is best for the task.
Paper Structure (45 sections, 8 equations, 7 figures, 4 tables)

This paper contains 45 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustration of our automatic QE evaluation approach.
  • Figure 2: QE ranking performance across different factors.
  • Figure 3: Correlation between the performance on evaluating translation segments and the performance on QE ranking.
  • Figure 4: Correlation between the performance on evaluating translation segments and the performance on QE ranking, limited by MT system and domain.
  • Figure 5: QE-M Task: Effect of number of references. The first 2 boxes use human references only, while the last 2 boxes also include synthetic references created by paraphrasing.
  • ...and 2 more figures