Table of Contents
Fetching ...

Towards Explainable Evaluation Metrics for Machine Translation

Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger

TL;DR

This concept paper identifies key properties as well as key goals of explainable machine translation metrics and provides a comprehensive synthesis of recent techniques, relating them to the established goals and properties.

Abstract

Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for machine translation (for example, COMET or BERTScore) are based on black-box large language models. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are more transparent. To foster more widespread acceptance of novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties as well as key goals of explainable machine translation metrics and provide a comprehensive synthesis of recent techniques, relating them to our established goals and properties. In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4. Finally, we contribute a vision of next-generation approaches, including natural language explanations. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent machine translation systems.

Towards Explainable Evaluation Metrics for Machine Translation

TL;DR

This concept paper identifies key properties as well as key goals of explainable machine translation metrics and provides a comprehensive synthesis of recent techniques, relating them to the established goals and properties.

Abstract

Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for machine translation (for example, COMET or BERTScore) are based on black-box large language models. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are more transparent. To foster more widespread acceptance of novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties as well as key goals of explainable machine translation metrics and provide a comprehensive synthesis of recent techniques, relating them to our established goals and properties. In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4. Finally, we contribute a vision of next-generation approaches, including natural language explanations. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent machine translation systems.
Paper Structure (10 sections, 1 equation, 5 figures, 1 table)

This paper contains 10 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: An overview of use-cases for MT metrics. In number 4, many candidate translations are produced for a single source. One of them is chosen as the main translation, via the metric.
  • Figure 2: Goals of explainable MT evaluation. Each goal has audiences, which might be interested in its fulfillment; these are indicated with "A:". Further, each goal follows intentions, which are denoted with $\shortrightarrow$.
  • Figure 3: Overview of papers addressing the explainability of machine translation evaluation. The graphic structure is adapted from a survey on post-hoc methods for explainable NLP by madsen-etal-2022-post. The rows show the types of explanations returned by the respective methods. They are ordered from decision-understanding to model-understanding. Decision-understanding techniques explain specific outputs of a metric, while model-understanding techniques describe general properties. The columns show the required level of model access that each explainability technique needs. Fields, where boxes overlap are colored in a darker grey. Note that most work listed under feature importance was developed as part of the Eval4NLP21fomicheva-2021 and the WMT22 QE zerva-EtAl-2022-WMT shared tasks. $^N$ indicates that a technique is not directly from the MT domain. $^R$ indicates that a technique was proposed to tackle robustness. $^F$ indicates that the technique was proposed to explore fairness aspects. $^E$ is set when the papers' authors themselves perceived that their method tackles explainability.
  • Figure 4: Explanation types shown in §\ref{['fig:taxonomy']}. For each type we provide a hypothetical example. For this we assume the exemplary translation that is shown in the top-left box. Here, the polysemic word "became" is translated with the wrong word sense "to receive" instead of "to get/to develop into". The correct translation is "wurden". The example for perturbation robustness, perturbs the hypothesis sentence.
  • Figure 5: Exemplary future explanation types for MT metrics. For each type we provide a hypothetical example. For this we assume the exemplary translation that is shown in the top-left box. Here, the polysemic word "became" is translated with the wrong word sense "to receive" instead of "to get/to develop into". The correct translation is "wurden".