Table of Contents
Fetching ...

Pitfalls and Outlooks in Using COMET

Vilém Zouhar, Pinzhen Chen, Tsz Kin Lam, Nikita Moghe, Barry Haddow

TL;DR

This work scrutinizes COMET, a neural MT quality metric, to reveal how technical, data-driven, and reporting practices can undermine fair, cross-study comparison. Through systematic experiments and targeted analyses, it uncovers issues from software version drift and numerical precision to translationese effects and domain biases, and demonstrates how these can distort scores or rankings. To address reproducibility, the authors introduce sacreCOMET, a tool that signatures configurations and citations, and propose concrete recommendations for robust reporting and evaluation. The findings underscore the need for careful calibration, multi-reference consideration, and cross-metric validation when deploying learned metrics in research and deployment. Collectively, the work guides practitioners toward more reliable, transparent, and interpretable usage of COMET and similar learned evaluation tools, with broader implications for learned metrics in NLP.

Abstract

The COMET metric has blazed a trail in the machine translation community, given its strong correlation with human judgements of translation quality. Its success stems from being a modified pre-trained multilingual model finetuned for quality assessment. However, it being a machine learning model also gives rise to a new set of pitfalls that may not be widely known. We investigate these unexpected behaviours from three aspects: 1) technical: obsolete software versions and compute precision; 2) data: empty content, language mismatch, and translationese at test time as well as distribution and domain biases in training; 3) usage and reporting: multi-reference support and model referencing in the literature. All of these problems imply that COMET scores are not comparable between papers or even technical setups and we put forward our perspective on fixing each issue. Furthermore, we release the sacreCOMET package that can generate a signature for the software and model configuration as well as an appropriate citation. The goal of this work is to help the community make more sound use of the COMET metric.

Pitfalls and Outlooks in Using COMET

TL;DR

This work scrutinizes COMET, a neural MT quality metric, to reveal how technical, data-driven, and reporting practices can undermine fair, cross-study comparison. Through systematic experiments and targeted analyses, it uncovers issues from software version drift and numerical precision to translationese effects and domain biases, and demonstrates how these can distort scores or rankings. To address reproducibility, the authors introduce sacreCOMET, a tool that signatures configurations and citations, and propose concrete recommendations for robust reporting and evaluation. The findings underscore the need for careful calibration, multi-reference consideration, and cross-metric validation when deploying learned metrics in research and deployment. Collectively, the work guides practitioners toward more reliable, transparent, and interpretable usage of COMET and similar learned evaluation tools, with broader implications for learned metrics in NLP.

Abstract

The COMET metric has blazed a trail in the machine translation community, given its strong correlation with human judgements of translation quality. Its success stems from being a modified pre-trained multilingual model finetuned for quality assessment. However, it being a machine learning model also gives rise to a new set of pitfalls that may not be widely known. We investigate these unexpected behaviours from three aspects: 1) technical: obsolete software versions and compute precision; 2) data: empty content, language mismatch, and translationese at test time as well as distribution and domain biases in training; 3) usage and reporting: multi-reference support and model referencing in the literature. All of these problems imply that COMET scores are not comparable between papers or even technical setups and we put forward our perspective on fixing each issue. Furthermore, we release the sacreCOMET package that can generate a signature for the software and model configuration as well as an appropriate citation. The goal of this work is to help the community make more sound use of the COMET metric.
Paper Structure (38 sections, 4 figures, 14 tables)

This paper contains 38 sections, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Distribution of instance-level scores for empty and baseline translations (x-axis: score; y-axis: count). See other translation directions in \ref{['app:empty-distribution']}.
  • Figure 2: Setup of an experiment with bottom 75% of En$\rightarrow$Zh scores which creates a bias in COMET$_\mathrm{22}^\mathrm{DA}$. In the new data for En$\rightarrow$Zh (bottom right) there are no translations with perfect scores. En$\rightarrow$De data are unaffected.
  • Figure 3: Prompt template used to request a paraphrase from GPT-4o, where $HYPOTHESIS is replaced by individual hypotheses.
  • Figure 4: Distribution of instance-level scores for empty and baseline translations (x-axis: score; y-axis: count).