Efficient Machine Translation Corpus Generation: Integrating Human-in-the-Loop Post-Editing with Large Language Models
Kamer Ali Yuksel, Ahmet Gunduz, Abdul Baseet Anees, Hassan Sawaf
TL;DR
The paper tackles the challenge of efficiently generating high-quality MT corpora by combining semi-automated post-editing with optional large language model (LLM) enhancements in a human-in-the-loop workflow. It introduces an online, production-oriented pipeline that uses LLMs for translation synthesis, annotation analysis, pseudo-labeling, and best-hypothesis recommendation, all within an AutoML-driven quality estimation loop based on FLAML. Key contributions include four LLM-enabled capabilities (translation synthesis from ensemble MT outputs, comprehensive annotation analysis, LLM-driven pseudo-labeling to expand the corpus, and LLM-suggested best translations), integrated with production training and customization for MT vendors. Empirical results show a 4.33% average quality improvement (SD 10.25%), a moderate correlation with COMET-QE (Spearman 0.40), and competitive top-k model-selection accuracy (Top-1 24%, Top-3 57%), illustrating practical benefits in efficiency and scalability for real-world MT deployment. The approach promises reduced annotator workload and scalable data augmentation, enabling customized, production-ready MT pipelines across languages and domains.
Abstract
This paper introduces an advanced methodology for machine translation (MT) corpus generation, integrating semi-automated, human-in-the-loop post-editing with large language models (LLMs) to enhance efficiency and translation quality. Building upon previous work that utilized real-time training of a custom MT quality estimation metric, this system incorporates novel LLM features such as Enhanced Translation Synthesis and Assisted Annotation Analysis, which improve initial translation hypotheses and quality assessments, respectively. Additionally, the system employs LLM-Driven Pseudo Labeling and a Translation Recommendation System to reduce human annotator workload in specific contexts. These improvements not only retain the original benefits of cost reduction and enhanced post-edit quality but also open new avenues for leveraging cutting-edge LLM advancements. The project's source code is available for community use, promoting collaborative developments in the field. The demo video can be accessed here.
