Large Language Models as Annotators for Machine Translation Quality Estimation

Sidi Wang; Sophie Arnoult; Amir Kamran

Large Language Models as Annotators for Machine Translation Quality Estimation

Sidi Wang, Sophie Arnoult, Amir Kamran

TL;DR

It is reckoned that segment-level annotations provide a strong rationale for LLMs and are key to good segment-level QE, and a simplified MQM scheme is proposed, mostly restricted to top-level categories, to guide LLM selection.

Abstract

Large Language Models (LLMs) have demonstrated excellent performance on Machine Translation Quality Estimation (MTQE), yet their high inference costs make them impractical for direct application. In this work, we propose applying LLMs to generate MQM-style annotations for training a COMET model: following Fernandes et al. (2023), we reckon that segment-level annotations provide a strong rationale for LLMs and are key to good segment-level QE. We propose a simplified MQM scheme, mostly restricted to top-level categories, to guide LLM selection. We present a systematic approach for the development of a GPT-4o-based prompt, called PPbMQM (Prompt-Pattern-based-MQM). We show that the resulting annotations correlate well with human annotations and that training COMET on them leads to competitive performance on segment-level QE for Chinese-English and English-German.

Large Language Models as Annotators for Machine Translation Quality Estimation

TL;DR

Abstract

Paper Structure (13 sections, 3 figures, 8 tables)

This paper contains 13 sections, 3 figures, 8 tables.

Introduction
Prompt development for MQM
Experiment design
Zero-shot prompting
Few-shot prompting
Downstream QE model training
Discussion
Conclusion
Appendix
Basic prompt design
Few-shot prompt
Stability analysis
MQM inter-annotator agreement

Figures (3)

Figure 1: Example MQM annotation. Error spans are marked with error type and severity.
Figure 2: Initial zero-shot prompt. This prompt was further developed by changing the error span index instruction into 'split the target sentence using NLTK tokenizer and get the marked text start and end index'.
Figure 3: The final few-shot prompt improved upon previous steps. Changes with regard to the zero-shot prompt are marked in red: severity scale; a detailed explanation of error categories; an example of Fluency; and an example and value instruction of Omission (since the 'marked text' should come from the source sentence). Here for figure readability, we removed some quotation marks in the pseudo-JSON strings.

Large Language Models as Annotators for Machine Translation Quality Estimation

TL;DR

Abstract

Large Language Models as Annotators for Machine Translation Quality Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)