Quality and Quantity of Machine Translation References for Automatic Metrics
Vilém Zouhar, Ondřej Bojar
TL;DR
The paper tackles how the quality and quantity of human references influence automatic MT evaluation metrics under budget constraints. It introduces a four-tier reference framework (R1–R4) plus post-edited and optimal references, evaluates multiple metrics via segment-level Kendall's $\\tau$, and demonstrates that aggregating up to about seven references significantly boosts correlations, with mixing references from different quality levels generally beneficial. A budget-allocation algorithm balances reference quality and quantity under a fixed budget $B$, yielding configurations that outperform extreme strategies focused solely on either quality or quantity. Qualitative analyses reveal that extremely high-quality references can introduce translation shifts that hinder some metrics, underscoring the importance of practical reference design for reliable evaluation and shared-task settings.
Abstract
Automatic machine translation metrics typically rely on human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average (or maximum) helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.
