Table of Contents
Fetching ...

MTUncertainty: Assessing the Need for Post-editing of Machine Translation Outputs by Fine-tuning OpenAI LLMs

Serge Gladkoff, Lifeng Han, Gleb Erofeev, Irina Sorokina, Goran Nenadic

TL;DR

The paper addresses the practical problem of predicting whether MT outputs require post-editing without reference translations. It proposes LLMB2PEN, a binary classification approach obtained by fine-tuning OpenAI LLMs (curie, davinci, gpt3.5-turbo) on bilingual triplets (source, MT, post-edited reference) with no prompt engineering, evaluated across eight language pairs. Key findings show that a fine-tuned gpt3.5 achieves about 82–84% accuracy, while larger models do not substantially outperform smaller ones, and that careful handling of LAI segments can yield meaningful post-editing savings; extended experiments across more language pairs and news-domain data corroborate these results and reveal language-specific patterns. The work demonstrates a viable path to reduce post-editing workload and costs in MT pipelines, with practical deployment strategies and avenues for ongoing, multilingual learning.

Abstract

Translation Quality Evaluation (TQE) is an essential step of the modern translation production process. TQE is critical in assessing both machine translation (MT) and human translation (HT) quality without reference translations. The ability to evaluate or even simply estimate the quality of translation automatically may open significant efficiency gains through process optimisation. This work examines whether the state-of-the-art large language models (LLMs) can be used for this purpose. We take OpenAI models as the best state-of-the-art technology and approach TQE as a binary classification task. On eight language pairs including English to Italian, German, French, Japanese, Dutch, Portuguese, Turkish, and Chinese, our experimental results show that fine-tuned gpt3.5 can demonstrate good performance on translation quality prediction tasks, i.e. whether the translation needs to be edited. Another finding is that simply increasing the sizes of LLMs does not lead to apparent better performances on this task by comparing the performance of three different versions of OpenAI models: curie, davinci, and gpt3.5 with 13B, 175B, and 175B parameters, respectively.

MTUncertainty: Assessing the Need for Post-editing of Machine Translation Outputs by Fine-tuning OpenAI LLMs

TL;DR

The paper addresses the practical problem of predicting whether MT outputs require post-editing without reference translations. It proposes LLMB2PEN, a binary classification approach obtained by fine-tuning OpenAI LLMs (curie, davinci, gpt3.5-turbo) on bilingual triplets (source, MT, post-edited reference) with no prompt engineering, evaluated across eight language pairs. Key findings show that a fine-tuned gpt3.5 achieves about 82–84% accuracy, while larger models do not substantially outperform smaller ones, and that careful handling of LAI segments can yield meaningful post-editing savings; extended experiments across more language pairs and news-domain data corroborate these results and reveal language-specific patterns. The work demonstrates a viable path to reduce post-editing workload and costs in MT pipelines, with practical deployment strategies and avenues for ongoing, multilingual learning.

Abstract

Translation Quality Evaluation (TQE) is an essential step of the modern translation production process. TQE is critical in assessing both machine translation (MT) and human translation (HT) quality without reference translations. The ability to evaluate or even simply estimate the quality of translation automatically may open significant efficiency gains through process optimisation. This work examines whether the state-of-the-art large language models (LLMs) can be used for this purpose. We take OpenAI models as the best state-of-the-art technology and approach TQE as a binary classification task. On eight language pairs including English to Italian, German, French, Japanese, Dutch, Portuguese, Turkish, and Chinese, our experimental results show that fine-tuned gpt3.5 can demonstrate good performance on translation quality prediction tasks, i.e. whether the translation needs to be edited. Another finding is that simply increasing the sizes of LLMs does not lead to apparent better performances on this task by comparing the performance of three different versions of OpenAI models: curie, davinci, and gpt3.5 with 13B, 175B, and 175B parameters, respectively.
Paper Structure (10 sections, 15 figures)

This paper contains 10 sections, 15 figures.

Figures (15)

  • Figure 1: LLMB2PEN Methodology Design on Fine-tuning LLMs for Binary Prediction of Post-editing Need on Translations.
  • Figure 2: EN-IT Examples on MT and Post-Editing
  • Figure 3: EN-DE Examples on MT and Post-Editing
  • Figure 4: EN-IT Confusion Matrix of LLMB2PEN, curie model: Clockwise from top-left corner (TN, FP, TP, FN)
  • Figure 5: EN-DE Confusion Matrix of LLMB2PEN, curie model: Clockwise from top-left corner (TN, FP, TP, FN)
  • ...and 10 more figures