Table of Contents
Fetching ...

Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains

Vilém Zouhar, Shuoyang Ding, Anna Currey, Tatyana Badeka, Jenyuan Wang, Brian Thompson

TL;DR

This work addresses whether fine-tuned machine translation metrics generalize to unseen domains by introducing a biomedical MQM dataset with 25k segment-level judgments across 11 directions and 21 systems. It benchmarks multiple metric families, analyzes domain robustness, and dissects the effects of fine-tuning versus pre-trained backbones (e.g., XLM-R and NLLB) on metric performance. The findings show that fine-tuned metrics exhibit a substantial domain gap in the bio domain, while Pre-trained+Algorithm metrics remain more robust; in-domain bio MQM data can substantially improve Comet performance. The authors provide public release of data and code and discuss implications for metric design, suggesting diversification of human judgments and better generalization strategies during fine-tuning to enhance cross-domain reliability.

Abstract

We introduce a new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain. We use this dataset to investigate whether machine translation (MT) metrics which are fine-tuned on human-generated MT quality judgements are robust to domain shifts between training and inference. We find that fine-tuned metrics exhibit a substantial performance drop in the unseen domain scenario relative to metrics that rely on the surface form, as well as pre-trained metrics which are not fine-tuned on MT quality judgments.

Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains

TL;DR

This work addresses whether fine-tuned machine translation metrics generalize to unseen domains by introducing a biomedical MQM dataset with 25k segment-level judgments across 11 directions and 21 systems. It benchmarks multiple metric families, analyzes domain robustness, and dissects the effects of fine-tuning versus pre-trained backbones (e.g., XLM-R and NLLB) on metric performance. The findings show that fine-tuned metrics exhibit a substantial domain gap in the bio domain, while Pre-trained+Algorithm metrics remain more robust; in-domain bio MQM data can substantially improve Comet performance. The authors provide public release of data and code and discuss implications for metric design, suggesting diversification of human judgments and better generalization strategies during fine-tuning to enhance cross-domain reliability.

Abstract

We introduce a new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain. We use this dataset to investigate whether machine translation (MT) metrics which are fine-tuned on human-generated MT quality judgements are robust to domain shifts between training and inference. We find that fine-tuned metrics exhibit a substantial performance drop in the unseen domain scenario relative to metrics that rely on the surface form, as well as pre-trained metrics which are not fine-tuned on MT quality judgments.
Paper Structure (36 sections, 7 figures, 8 tables)

This paper contains 36 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Automatic machine translation metric performance on the WMT and biomedical domains, averaged across metric types (see \ref{['fig:domain_mismatch']} for full results).
  • Figure 2: Gains in segment-level correlation (Kendall's $\tau$) when comparing Surface-Form metrics (average performance of BLEU, ChrF, and TER) to a given metric, on the WMT and bio test sets. Gains for Pre-trained+Fine-tuned metrics are much smaller in the unseen bio domain than the WMT domain. Pre-trained+Algorithm metrics, which do not train on prior WMT data, do not exhibit the same bias. See \ref{['sec:raw_kendal_tao']} for results in tabular form.
  • Figure 3: Average performance (8 seeds) of Comet fine-tuned on varying amounts of MQM bio data.
  • Figure 4: Metric performance when pre-trained model is fine-tuned (FT) on bio or WMT domain data. Lower perplexity improvesBERTScore$\bigcirc$ but worsensComet$\square$. Perplexity is average of MLM and TLM objectives on the text portion of the MQM dataset for both domains.
  • Figure 5: Multiple NLLB MT models are used as the base model for PrismSrc. Fine-tuning the underlying MT model improves the metric. Compute constraints preclude finetuning NLLB-3.3B.
  • ...and 2 more figures