xTower: A Multilingual LLM for Explaining and Correcting Translation Errors

Marcos Treviso; Nuno M. Guerreiro; Sweta Agrawal; Ricardo Rei; José Pombal; Tania Vaz; Helena Wu; Beatriz Silva; Daan van Stigt; André F. T. Martins

xTower: A Multilingual LLM for Explaining and Correcting Translation Errors

Marcos Treviso, Nuno M. Guerreiro, Sweta Agrawal, Ricardo Rei, José Pombal, Tania Vaz, Helena Wu, Beatriz Silva, Daan van Stigt, André F. T. Martins

TL;DR

xTower addresses the lack of interpretable feedback in MT outputs by using a multilingual LLM built on $TowerBase$ to generate free-text explanations for translation errors and to propose corrected translations. It employs distillation data from GPT-4 to train on MQM-annotated spans, enabling both referenceless and reference-based prompting, and couples explanations with corrections via chain-of-thought prompting. Intrinsic human evaluations show explanations are generally related to the marked errors and helpful for understanding and potential correction, while extrinsic experiments demonstrate significant translation-quality gains, including a hybrid strategy that selects between original and corrected translations. The approach highlights the potential of error-focused explanations to improve MT interpretability and post-editing performance in a modular, scalable way across multiple language pairs.

Abstract

While machine translation (MT) systems are achieving increasingly strong performance on benchmarks, they often produce translations with errors and anomalies. Understanding these errors can potentially help improve the translation quality and user experience. This paper introduces xTower, an open large language model (LLM) built on top of TowerBase designed to provide free-text explanations for translation errors in order to guide the generation of a corrected translation. The quality of the generated explanations by xTower are assessed via both intrinsic and extrinsic evaluation. We ask expert translators to evaluate the quality of the explanations across two dimensions: relatedness towards the error span being explained and helpfulness in error understanding and improving translation quality. Extrinsically, we test xTower across various experimental setups in generating translation corrections, demonstrating significant improvements in translation quality. Our findings highlight xTower's potential towards not only producing plausible and helpful explanations of automatic translations, but also leveraging them to suggest corrected translations.

xTower: A Multilingual LLM for Explaining and Correcting Translation Errors

TL;DR

xTower addresses the lack of interpretable feedback in MT outputs by using a multilingual LLM built on

to generate free-text explanations for translation errors and to propose corrected translations. It employs distillation data from GPT-4 to train on MQM-annotated spans, enabling both referenceless and reference-based prompting, and couples explanations with corrections via chain-of-thought prompting. Intrinsic human evaluations show explanations are generally related to the marked errors and helpful for understanding and potential correction, while extrinsic experiments demonstrate significant translation-quality gains, including a hybrid strategy that selects between original and corrected translations. The approach highlights the potential of error-focused explanations to improve MT interpretability and post-editing performance in a modular, scalable way across multiple language pairs.

Abstract

Paper Structure (61 sections, 1 equation, 7 figures, 12 tables)

This paper contains 61 sections, 1 equation, 7 figures, 12 tables.

Introduction
Background
Tower.
MT Evaluation.
xTower
Distillation
Data.
Prompt.
Finetuning
Explaining Translation Errors
Experimental Setup
Data.
Prompting.
Evaluation.
Relatedness
...and 46 more sections

Figures (7)

Figure 1: Illustration of our approach. In this example, the input consisting of a source and a translation is passed to xComet, which annotates the translation with error spans and produces a (discretized) quality score. The full input, marked translation, and quality score are passed to xTower, which, in turn, produces an explanation for each error span along with a final suggestion for a new, corrected translation.
Figure 2: Relatedness according to the number of spans for xComet and human error spans.
Figure 3: At the top, we show the quality of the original translation versus the corrected translation on en-de with human spans. At the bottom, we show how often the latter is higher than the former per quality bin.
Figure 4: Delta between Comet scores for corrected and original translations according to how related explanations are to error spans.
Figure 5: Screenshot of the relatedness task interface presented to annotators.
...and 2 more figures

xTower: A Multilingual LLM for Explaining and Correcting Translation Errors

TL;DR

Abstract

xTower: A Multilingual LLM for Explaining and Correcting Translation Errors

Authors

TL;DR

Abstract

Table of Contents

Figures (7)