Table of Contents
Fetching ...

Beyond Human-Only: Evaluating Human-Machine Collaboration for Collecting High-Quality Translation Data

Zhongtao Liu, Parker Riley, Daniel Deutsch, Alison Lui, Mengmeng Niu, Apu Shah, Markus Freitag

TL;DR

It is demonstrated that human-machine collaboration can match or even exceed the quality of human-only translations, while being more cost-efficient and error analysis reveals the complementary strengths between human and machine contributions, highlighting the effectiveness of collaborative methods.

Abstract

Collecting high-quality translations is crucial for the development and evaluation of machine translation systems. However, traditional human-only approaches are costly and slow. This study presents a comprehensive investigation of 11 approaches for acquiring translation data, including human-only, machineonly, and hybrid approaches. Our findings demonstrate that human-machine collaboration can match or even exceed the quality of human-only translations, while being more cost-efficient. Error analysis reveals the complementary strengths between human and machine contributions, highlighting the effectiveness of collaborative methods. Cost analysis further demonstrates the economic benefits of human-machine collaboration methods, with some approaches achieving top-tier quality at around 60% of the cost of traditional methods. We release a publicly available dataset containing nearly 18,000 segments of varying translation quality with corresponding human ratings to facilitate future research.

Beyond Human-Only: Evaluating Human-Machine Collaboration for Collecting High-Quality Translation Data

TL;DR

It is demonstrated that human-machine collaboration can match or even exceed the quality of human-only translations, while being more cost-efficient and error analysis reveals the complementary strengths between human and machine contributions, highlighting the effectiveness of collaborative methods.

Abstract

Collecting high-quality translations is crucial for the development and evaluation of machine translation systems. However, traditional human-only approaches are costly and slow. This study presents a comprehensive investigation of 11 approaches for acquiring translation data, including human-only, machineonly, and hybrid approaches. Our findings demonstrate that human-machine collaboration can match or even exceed the quality of human-only translations, while being more cost-efficient. Error analysis reveals the complementary strengths between human and machine contributions, highlighting the effectiveness of collaborative methods. Cost analysis further demonstrates the economic benefits of human-machine collaboration methods, with some approaches achieving top-tier quality at around 60% of the cost of traditional methods. We release a publicly available dataset containing nearly 18,000 segments of varying translation quality with corresponding human ratings to facilitate future research.

Paper Structure

This paper contains 21 sections, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Our 11 translation systems, organized by initial translation type (human or machine) and post-editing type (none, human, or machine). Detailed system descriptions are provided in Section \ref{['sec:system_description']}.
  • Figure 2: Cross-BLEU scores for different EnDe translation collection approaches.
  • Figure 3: MQM Scores for different translation systems across two language pairs: Chinese-English and English-German. Bars represents the average MQM scores for each translation system. The systems are grouped and colored by initial translation and further categorized by post-editing method with different fill patterns. Lower MQM scores indicate better quality.
  • Figure 4: Agreement between HumanPE and LLMRefine in identifying segments requiring post-edit on English-German data. Each pie chartrepresents a different initial translation source.
  • Figure 5: Error changes percentages by different post-editing approaches on English-German data. The percentages present the changes in error counts for each post-editing method compared to its initial translation. A negative indicates a decrease in errors, while positive value indicates an increase in the error type.
  • ...and 6 more figures