Table of Contents
Fetching ...

TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement

Zhaopeng Feng, Yan Zhang, Hao Li, Bei Wu, Jiayu Liao, Wenqiang Liu, Jun Lang, Yang Feng, Jian Wu, Zuozhu Liu

TL;DR

This paper tackles persistent errors in LLM-based translation by introducing TEaR, a systematic Translate-Estimate-Refine framework that leverages an explicit estimation module to guide self-refinement within a single LLM. The Translate module generates an initial translation from few-shot prompts, the Estimate module provides human-like, error-focused feedback via MQM-inspired guidelines, and the Refine module uses this feedback to produce corrected translations. Across 17 translation directions and multiple language pairs, TEaR yields consistent improvements over baselines in both automatic metrics (BLEU, COMET, BLEURT, Kiwi) and human preferences, and its analysis reveals that the quality of the estimation step is a critical determinant of success. The work also explores cross-model corrections, showing correlations between translation and evaluation capabilities in general-purpose LLMs and highlighting opportunities for further optimization and model sharpening. The framework is demonstrated with GPT-3.5-turbo, Claude-2, and Gemini-Pro, and the authors provide code and data to support reproducibility and further research.

Abstract

Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT). However, careful evaluations by human reveal that the translations produced by LLMs still contain multiple errors. Importantly, feeding back such error information into the LLMs can lead to self-refinement and result in improved translation performance. Motivated by these insights, we introduce a systematic LLM-based self-refinement translation framework, named \textbf{TEaR}, which stands for \textbf{T}ranslate, \textbf{E}stimate, \textbf{a}nd \textbf{R}efine, marking a significant step forward in this direction. Our findings demonstrate that 1) our self-refinement framework successfully assists LLMs in improving their translation quality across a wide range of languages, whether it's from high-resource languages to low-resource ones or whether it's English-centric or centered around other languages; 2) TEaR exhibits superior systematicity and interpretability; 3) different estimation strategies yield varied impacts, directly affecting the effectiveness of the final corrections. Additionally, traditional neural translation models and evaluation models operate separately, often focusing on singular tasks due to their limited capabilities, while general-purpose LLMs possess the capability to undertake both tasks simultaneously. We further conduct cross-model correction experiments to investigate the potential relationship between the translation and evaluation capabilities of general-purpose LLMs. Our code and data are available at https://github.com/fzp0424/self_correct_mt

TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement

TL;DR

This paper tackles persistent errors in LLM-based translation by introducing TEaR, a systematic Translate-Estimate-Refine framework that leverages an explicit estimation module to guide self-refinement within a single LLM. The Translate module generates an initial translation from few-shot prompts, the Estimate module provides human-like, error-focused feedback via MQM-inspired guidelines, and the Refine module uses this feedback to produce corrected translations. Across 17 translation directions and multiple language pairs, TEaR yields consistent improvements over baselines in both automatic metrics (BLEU, COMET, BLEURT, Kiwi) and human preferences, and its analysis reveals that the quality of the estimation step is a critical determinant of success. The work also explores cross-model corrections, showing correlations between translation and evaluation capabilities in general-purpose LLMs and highlighting opportunities for further optimization and model sharpening. The framework is demonstrated with GPT-3.5-turbo, Claude-2, and Gemini-Pro, and the authors provide code and data to support reproducibility and further research.

Abstract

Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT). However, careful evaluations by human reveal that the translations produced by LLMs still contain multiple errors. Importantly, feeding back such error information into the LLMs can lead to self-refinement and result in improved translation performance. Motivated by these insights, we introduce a systematic LLM-based self-refinement translation framework, named \textbf{TEaR}, which stands for \textbf{T}ranslate, \textbf{E}stimate, \textbf{a}nd \textbf{R}efine, marking a significant step forward in this direction. Our findings demonstrate that 1) our self-refinement framework successfully assists LLMs in improving their translation quality across a wide range of languages, whether it's from high-resource languages to low-resource ones or whether it's English-centric or centered around other languages; 2) TEaR exhibits superior systematicity and interpretability; 3) different estimation strategies yield varied impacts, directly affecting the effectiveness of the final corrections. Additionally, traditional neural translation models and evaluation models operate separately, often focusing on singular tasks due to their limited capabilities, while general-purpose LLMs possess the capability to undertake both tasks simultaneously. We further conduct cross-model correction experiments to investigate the potential relationship between the translation and evaluation capabilities of general-purpose LLMs. Our code and data are available at https://github.com/fzp0424/self_correct_mt
Paper Structure (29 sections, 1 equation, 8 figures, 27 tables)

This paper contains 29 sections, 1 equation, 8 figures, 27 tables.

Figures (8)

  • Figure 1: The original translation is from the submission of GPT-4 kocmi2023findings for WMT23. The MQM error label is annotated by human experts. We use OpenAI API gpt-4 to correct the translation. The metric score increases from 83.29 to 84.22 using COMET-22 (wmt22-comet-da) rei2020comet model.
  • Figure 2: TEaR framework includes three steps: Translate, Estimate, and Refine. All steps are executed using different prompts ($\mathcal{T}_{translate}$, $\mathcal{T}_{estimate}$, $\mathcal{T}_{refine}$). We detail our prompting strategies in Section \ref{['prompts']}.
  • Figure 3: Results for 16 translation directions using GPT-3.5-turbo. IT: the initial translation using few-shot prompt; SCoT: Structured Chain-of-Thought raunak2023leveraging; CT: inserting the word $"$bad$"$ to do the contrastive translation chen2023iterative. TowerInstruct-13B: the state-of-the-art open-source APE model has already been trained on WMT21 and WMT22 data alves2024tower.
  • Figure 4: Results for human preference study, comparing TEaR with IT, SCoT, and CT. The data for the first row of subfigures comes from WMT22 tested on GPT-3.5-turbo, while the experiments for the second row of subfigures were conducted on our WMT23 Zh-En dataset using three models (GPT-3.5-turbo, Gemini-Pro, Claude-2).
  • Figure 5: COMET scores for involving various feedback estimation strategies in the TEaR. "-" denotes the initial translation (IT). zero-shot and few-shot reflect the use of different prompting methods with GPT-3.5-turbo, while GPT-4 w/ human indicates estimations made using GPT-4 with human assistance.
  • ...and 3 more figures