Table of Contents
Fetching ...

Prompting Large Language Models with Human Error Markings for Self-Correcting Machine Translation

Nathaniel Berger, Stefan Riezler, Miriam Exel, Matthias Huck

TL;DR

This work introduces a light-weight, error-marked prompting approach to improve domain-specific machine translation by augmenting translation memories with token-level error markings. At test time, a user marks errors in a translation, and a few similar in-context examples retrieved from the error-annotated PE-TM guide an LLM to focus corrections on the marked tokens. Experiments with IT-domain English–German data using Llama 13B and GPT-3.5 show that error markings significantly increase targeted edits and improve BLEU and TER scores compared to MT and automatic post-editing, with human evaluation indicating a majority of MRK edits are correct. The study suggests a practical, interactive feedback loop for steering LLMs toward focused self-corrections and points to future work on learned error-marking models and larger translation memories.

Abstract

While large language models (LLMs) pre-trained on massive amounts of unpaired language data have reached the state-of-the-art in machine translation (MT) of general domain texts, post-editing (PE) is still required to correct errors and to enhance term translation quality in specialized domains. In this paper we present a pilot study of enhancing translation memories (TM) produced by PE (source segments, machine translations, and reference translations, henceforth called PE-TM) for the needs of correct and consistent term translation in technical domains. We investigate a light-weight two-step scenario where, at inference time, a human translator marks errors in the first translation step, and in a second step a few similar examples are extracted from the PE-TM to prompt an LLM. Our experiment shows that the additional effort of augmenting translations with human error markings guides the LLM to focus on a correction of the marked errors, yielding consistent improvements over automatic PE (APE) and MT from scratch.

Prompting Large Language Models with Human Error Markings for Self-Correcting Machine Translation

TL;DR

This work introduces a light-weight, error-marked prompting approach to improve domain-specific machine translation by augmenting translation memories with token-level error markings. At test time, a user marks errors in a translation, and a few similar in-context examples retrieved from the error-annotated PE-TM guide an LLM to focus corrections on the marked tokens. Experiments with IT-domain English–German data using Llama 13B and GPT-3.5 show that error markings significantly increase targeted edits and improve BLEU and TER scores compared to MT and automatic post-editing, with human evaluation indicating a majority of MRK edits are correct. The study suggests a practical, interactive feedback loop for steering LLMs toward focused self-corrections and points to future work on learned error-marking models and larger translation memories.

Abstract

While large language models (LLMs) pre-trained on massive amounts of unpaired language data have reached the state-of-the-art in machine translation (MT) of general domain texts, post-editing (PE) is still required to correct errors and to enhance term translation quality in specialized domains. In this paper we present a pilot study of enhancing translation memories (TM) produced by PE (source segments, machine translations, and reference translations, henceforth called PE-TM) for the needs of correct and consistent term translation in technical domains. We investigate a light-weight two-step scenario where, at inference time, a human translator marks errors in the first translation step, and in a second step a few similar examples are extracted from the PE-TM to prompt an LLM. Our experiment shows that the additional effort of augmenting translations with human error markings guides the LLM to focus on a correction of the marked errors, yielding consistent improvements over automatic PE (APE) and MT from scratch.
Paper Structure (14 sections, 5 figures, 5 tables)

This paper contains 14 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Example of a 1-shot prompt for English-to-German Translation. Error markings are inside bold faced tags $<$bad$>$$<$/bad$>$. The demonstration example consists of a source segment in English (in green), a translation hypothesis in German (in blue), and a correction (in red). The test example shows a correction of the translation of "environment variable" from "Umweltvariable" into "Umgebungsvariable" learned by the LLM (in bold-faced red).
  • Figure 2: Instructions given to annotators on how to mark errors in sentences, including how to use the interface and desired marking behavior
  • Figure 3: Example of 5-shot prompt for English-to-German Translation. Each demonstration example consists of a source segment in English (in green), and a reference translation (in red).
  • Figure 4: Example of 5-shot prompt for English-to-German Automatic Post-Editing (APE). Each demonstration example consists of a source segment in English (in green), a translation hypothesis in German (in blue), and a reference translation (in red).
  • Figure 5: Example of 5-shot prompt for English-to-German Post-Editing with error markings (MRK). Error markings inside by tags $<$bad$>$$<$/bad$>$. Each demonstration example consists of a source segment in English (in green), a translation hypothesis in German (in blue), and a reference translation (in red).