Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains
Vilém Zouhar, Shuoyang Ding, Anna Currey, Tatyana Badeka, Jenyuan Wang, Brian Thompson
TL;DR
This work addresses whether fine-tuned machine translation metrics generalize to unseen domains by introducing a biomedical MQM dataset with 25k segment-level judgments across 11 directions and 21 systems. It benchmarks multiple metric families, analyzes domain robustness, and dissects the effects of fine-tuning versus pre-trained backbones (e.g., XLM-R and NLLB) on metric performance. The findings show that fine-tuned metrics exhibit a substantial domain gap in the bio domain, while Pre-trained+Algorithm metrics remain more robust; in-domain bio MQM data can substantially improve Comet performance. The authors provide public release of data and code and discuss implications for metric design, suggesting diversification of human judgments and better generalization strategies during fine-tuning to enhance cross-domain reliability.
Abstract
We introduce a new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain. We use this dataset to investigate whether machine translation (MT) metrics which are fine-tuned on human-generated MT quality judgements are robust to domain shifts between training and inference. We find that fine-tuned metrics exhibit a substantial performance drop in the unseen domain scenario relative to metrics that rely on the surface form, as well as pre-trained metrics which are not fine-tuned on MT quality judgments.
