Table of Contents
Fetching ...

Quality and Quantity of Machine Translation References for Automatic Metrics

Vilém Zouhar, Ondřej Bojar

TL;DR

The paper tackles how the quality and quantity of human references influence automatic MT evaluation metrics under budget constraints. It introduces a four-tier reference framework (R1–R4) plus post-edited and optimal references, evaluates multiple metrics via segment-level Kendall's $\\tau$, and demonstrates that aggregating up to about seven references significantly boosts correlations, with mixing references from different quality levels generally beneficial. A budget-allocation algorithm balances reference quality and quantity under a fixed budget $B$, yielding configurations that outperform extreme strategies focused solely on either quality or quantity. Qualitative analyses reveal that extremely high-quality references can introduce translation shifts that hinder some metrics, underscoring the importance of practical reference design for reliable evaluation and shared-task settings.

Abstract

Automatic machine translation metrics typically rely on human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average (or maximum) helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.

Quality and Quantity of Machine Translation References for Automatic Metrics

TL;DR

The paper tackles how the quality and quantity of human references influence automatic MT evaluation metrics under budget constraints. It introduces a four-tier reference framework (R1–R4) plus post-edited and optimal references, evaluates multiple metrics via segment-level Kendall's , and demonstrates that aggregating up to about seven references significantly boosts correlations, with mixing references from different quality levels generally beneficial. A budget-allocation algorithm balances reference quality and quantity under a fixed budget , yielding configurations that outperform extreme strategies focused solely on either quality or quantity. Qualitative analyses reveal that extremely high-quality references can introduce translation shifts that hinder some metrics, underscoring the importance of practical reference design for reliable evaluation and shared-task settings.

Abstract

Automatic machine translation metrics typically rely on human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average (or maximum) helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.
Paper Structure (17 sections, 5 figures, 10 tables)

This paper contains 17 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Metric performance with multiple sampled references from the pool of the original human translations and their post-edited versions. Confidence t-test intervals indicate 99% confidence of the mean (of 10 samples) being in the shaded area. Biggest advantage is gained from at least three references and taking their segment-level average (max aggregation not shown).
  • Figure 2: Metric performance with references (ordered by usefulness) from mixed sources (e.g. 25% R1 and 75% R2; rightmost is 100% R3). Mixing references does not hurt any metric.
  • Figure 3: Illustration of two operations from \ref{['alg:ref_allocation']}. The initial state is on the left. Then, a new segment x$_{89}$ is added to the $l_3$ level. Lastly, the segment x$_{0}$ is promoted from $l_1$ to $l_2$.
  • Figure 4: Heatmaps of chrF (left) and COMET$_{20}^$ (right) Kendall's $\tau$ correlations on reference configurations created with a specific budget (x-axis) and quality-quantity trade-off $\lambda$ (y-axis). $\bigstar$ marks the best value in each column (fixed budget). The first column corresponds to the cheapest translation for all test segments, with no room for selection. $\lambda \in [0, 0.7]$ and $t = 0.5$. With a limited budget, e.g. $2|S|$ or $3|S|$, it makes more sense to add some references of a higher quality rather than covering the whole test set with a second reference. With more budget available, multiple references per segment become more beneficial.
  • Figure 5: Average number of references per one segment allocated by \ref{['alg:ref_allocation']} with $\tau=0.5$ (top) and $\tau=10^{-3}$ (bottom).