Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

Boxuan Lyu; Haiyue Song; Zhi Qu

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

Boxuan Lyu, Haiyue Song, Zhi Qu

Abstract

Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels. Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

Abstract

Paper Structure (25 sections, 9 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 25 sections, 9 equations, 1 figure, 3 tables, 1 algorithm.

Introduction
Related Work
MT Automatic Metrics
Self-Evolution of LLMs
Preliminaries
Error Span Detection Model
MBR Decoding for ESD
ESD Model Training
Supervised Fine-Tuning (SFT)
Direct Preference Optimization (DPO)
Kahneman-Tversky Optimization (KTO)
Proposed Method: Iterative MBR Distillation for ESD
Experiments
Experimental Setup
Datasets and Model
...and 10 more sections

Figures (1)

Figure 1: Overview of the Iterative MBR Distillation framework for ESD. Starting with unlabeled source-translation pairs, the model generates diverse candidate error spans. MBR decoding then evaluates these candidates to assign utility scores, identifying high-quality pseudo-labels (e.g., the best and worst hypotheses). Finally, the model is fine-tuned on these self-generated labels using SFT, DPO, or KTO. This cycle repeats iteratively, enabling the model to self-evolve and refine its ESD capabilities without relying on human annotations.

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

Abstract

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

Authors

Abstract

Table of Contents

Figures (1)