Table of Contents
Fetching ...

MQM-Chat: Multidimensional Quality Metrics for Chat Translation

Yunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Kentaro Inui

TL;DR

MQM-Chat addresses the lack of chat-specific evaluation metrics by introducing a seven-type error taxonomy tailored to chat translation and built on the MQM framework. The study applies MQM-Chat to five state-of-the-art models across zh->en and ja->en, revealing significant chat-specific errors related to ambiguity, buzzwords, and dialogue consistency, and highlighting the importance of stylized content in chats. Through human and automated annotations, the approach demonstrates higher sensitivity to chat nuances than traditional metrics and provides a reusable benchmark with datasets across multiple domains. Automated MQM-Chat annotations show promising alignment with human judgments, motivating future refinement and broader language coverage.

Abstract

The complexities of chats pose significant challenges for machine translation models. Recognizing the need for a precise evaluation metric to address the issues of chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat). Through the experiments of five models using MQM-Chat, we observed that all models generated certain fundamental errors, while each of them has different shortcomings, such as omission, overly correcting ambiguous source content, and buzzword issues, resulting in the loss of stylized information. Our findings underscore the effectiveness of MQM-Chat in evaluating chat translation, emphasizing the importance of stylized content and dialogue consistency for future studies.

MQM-Chat: Multidimensional Quality Metrics for Chat Translation

TL;DR

MQM-Chat addresses the lack of chat-specific evaluation metrics by introducing a seven-type error taxonomy tailored to chat translation and built on the MQM framework. The study applies MQM-Chat to five state-of-the-art models across zh->en and ja->en, revealing significant chat-specific errors related to ambiguity, buzzwords, and dialogue consistency, and highlighting the importance of stylized content in chats. Through human and automated annotations, the approach demonstrates higher sensitivity to chat nuances than traditional metrics and provides a reusable benchmark with datasets across multiple domains. Automated MQM-Chat annotations show promising alignment with human judgments, motivating future refinement and broader language coverage.

Abstract

The complexities of chats pose significant challenges for machine translation models. Recognizing the need for a precise evaluation metric to address the issues of chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat). Through the experiments of five models using MQM-Chat, we observed that all models generated certain fundamental errors, while each of them has different shortcomings, such as omission, overly correcting ambiguous source content, and buzzword issues, resulting in the loss of stylized information. Our findings underscore the effectiveness of MQM-Chat in evaluating chat translation, emphasizing the importance of stylized content and dialogue consistency for future studies.
Paper Structure (35 sections, 5 figures, 7 tables)

This paper contains 35 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Mapping of error types in MQM-Chat (green) and MQM Core (orange) and used MQM Full (yellow). Blocks with deeper colors (Terminology, Locale Convention, and Audience Appropriateness) suggest that corresponding sub-categories are included and merged into MQM-Chat. Blocks with gray text (Grammar, Spelling, Punctuation, Character Encoding) are errors that are only marked if they totally interrupt the translation substantially. Note that the relationship between the mapping blocks is not simply an inclusion relationship because MQM-Chat error types cover broader issues in chat translation.
  • Figure 2: Heatmaps of the error numbers in MQM-Chat human annotations. Darker colors indicate higher numbers.
  • Figure 3: Heatmaps of error numbers in MQM-Chat auto annotations. Darker colors indicate higher numbers.
  • Figure 4: The prompt for MQM-Chat automatic annotations based on GEMBA-MQM.
  • Figure 5: Heatmaps of the error numbers in standard MQM human and auto annotations. Darker colors indicate higher numbers. Standard MQM human annotations are applied to 25% data. Error types not labeled in any of the four model translations are not shown.