MQM-Chat: Multidimensional Quality Metrics for Chat Translation
Yunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Kentaro Inui
TL;DR
MQM-Chat addresses the lack of chat-specific evaluation metrics by introducing a seven-type error taxonomy tailored to chat translation and built on the MQM framework. The study applies MQM-Chat to five state-of-the-art models across zh->en and ja->en, revealing significant chat-specific errors related to ambiguity, buzzwords, and dialogue consistency, and highlighting the importance of stylized content in chats. Through human and automated annotations, the approach demonstrates higher sensitivity to chat nuances than traditional metrics and provides a reusable benchmark with datasets across multiple domains. Automated MQM-Chat annotations show promising alignment with human judgments, motivating future refinement and broader language coverage.
Abstract
The complexities of chats pose significant challenges for machine translation models. Recognizing the need for a precise evaluation metric to address the issues of chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat). Through the experiments of five models using MQM-Chat, we observed that all models generated certain fundamental errors, while each of them has different shortcomings, such as omission, overly correcting ambiguous source content, and buzzword issues, resulting in the loss of stylized information. Our findings underscore the effectiveness of MQM-Chat in evaluating chat translation, emphasizing the importance of stylized content and dialogue consistency for future studies.
