TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement
Zhaopeng Feng, Yan Zhang, Hao Li, Bei Wu, Jiayu Liao, Wenqiang Liu, Jun Lang, Yang Feng, Jian Wu, Zuozhu Liu
TL;DR
This paper tackles persistent errors in LLM-based translation by introducing TEaR, a systematic Translate-Estimate-Refine framework that leverages an explicit estimation module to guide self-refinement within a single LLM. The Translate module generates an initial translation from few-shot prompts, the Estimate module provides human-like, error-focused feedback via MQM-inspired guidelines, and the Refine module uses this feedback to produce corrected translations. Across 17 translation directions and multiple language pairs, TEaR yields consistent improvements over baselines in both automatic metrics (BLEU, COMET, BLEURT, Kiwi) and human preferences, and its analysis reveals that the quality of the estimation step is a critical determinant of success. The work also explores cross-model corrections, showing correlations between translation and evaluation capabilities in general-purpose LLMs and highlighting opportunities for further optimization and model sharpening. The framework is demonstrated with GPT-3.5-turbo, Claude-2, and Gemini-Pro, and the authors provide code and data to support reproducibility and further research.
Abstract
Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT). However, careful evaluations by human reveal that the translations produced by LLMs still contain multiple errors. Importantly, feeding back such error information into the LLMs can lead to self-refinement and result in improved translation performance. Motivated by these insights, we introduce a systematic LLM-based self-refinement translation framework, named \textbf{TEaR}, which stands for \textbf{T}ranslate, \textbf{E}stimate, \textbf{a}nd \textbf{R}efine, marking a significant step forward in this direction. Our findings demonstrate that 1) our self-refinement framework successfully assists LLMs in improving their translation quality across a wide range of languages, whether it's from high-resource languages to low-resource ones or whether it's English-centric or centered around other languages; 2) TEaR exhibits superior systematicity and interpretability; 3) different estimation strategies yield varied impacts, directly affecting the effectiveness of the final corrections. Additionally, traditional neural translation models and evaluation models operate separately, often focusing on singular tasks due to their limited capabilities, while general-purpose LLMs possess the capability to undertake both tasks simultaneously. We further conduct cross-model correction experiments to investigate the potential relationship between the translation and evaluation capabilities of general-purpose LLMs. Our code and data are available at https://github.com/fzp0424/self_correct_mt
