Table of Contents
Fetching ...

Is Context Helpful for Chat Translation Evaluation?

Sweta Agrawal, Amin Farajian, Patrick Fernandes, Ricardo Rei, André F. T. Martins

TL;DR

A meta-evaluation of existing sentence-level automatic metrics finds that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings, and shows that augmenting neural learned metrics with contextual information helps improve correlation with human judgments in the reference-free scenario and when evaluating translations in out-of-English settings.

Abstract

Despite the recent success of automatic metrics for assessing translation quality, their application in evaluating the quality of machine-translated chats has been limited. Unlike more structured texts like news, chat conversations are often unstructured, short, and heavily reliant on contextual information. This poses questions about the reliability of existing sentence-level metrics in this domain as well as the role of context in assessing the translation quality. Motivated by this, we conduct a meta-evaluation of existing sentence-level automatic metrics, primarily designed for structured domains such as news, to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We then investigate how incorporating conversational contextual information in these metrics affects their performance. Our findings show that augmenting neural learned metrics with contextual information helps improve correlation with human judgments in the reference-free scenario and when evaluating translations in out-of-English settings. Finally, we propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model (LLM) and further validate that adding context helps even for LLM-based evaluation metrics.

Is Context Helpful for Chat Translation Evaluation?

TL;DR

A meta-evaluation of existing sentence-level automatic metrics finds that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings, and shows that augmenting neural learned metrics with contextual information helps improve correlation with human judgments in the reference-free scenario and when evaluating translations in out-of-English settings.

Abstract

Despite the recent success of automatic metrics for assessing translation quality, their application in evaluating the quality of machine-translated chats has been limited. Unlike more structured texts like news, chat conversations are often unstructured, short, and heavily reliant on contextual information. This poses questions about the reliability of existing sentence-level metrics in this domain as well as the role of context in assessing the translation quality. Motivated by this, we conduct a meta-evaluation of existing sentence-level automatic metrics, primarily designed for structured domains such as news, to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We then investigate how incorporating conversational contextual information in these metrics affects their performance. Our findings show that augmenting neural learned metrics with contextual information helps improve correlation with human judgments in the reference-free scenario and when evaluating translations in out-of-English settings. Finally, we propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model (LLM) and further validate that adding context helps even for LLM-based evaluation metrics.
Paper Structure (40 sections, 1 equation, 8 figures, 5 tables)

This paper contains 40 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: An example bilingual conversation from the MAIA corpus martins-etal-2020-project: the agent and the customer only see the texts in English and Portuguese, respectively. The errors (both MT and user-generated) are bold-faced.
  • Figure 2: Conversational texts tend to be much shorter relative to news texts in the WMT22 English-German dataset.
  • Figure 3: Counts of MQM error categories normalized by the number of annotated instances for each domain: frequent errors differ in the two domains.
  • Figure 4: Impact of varying context window and context type (across/within) on average correlation across Agent and Customer settings: adding complete context (across) helps improve metrics performance in out-of-English reference-free settings (Agent) but is detrimental for into-English (Customer) evaluation.
  • Figure 5: Context helps the most in improving the translation quality estimation of shorter (source character length $\leq$ 20) and potentially ambiguous sentences (averaged over "all" Agent language pairs).
  • ...and 3 more figures