Table of Contents
Fetching ...

Quality Estimation Reranking for Document-Level Translation

Krzysztof Mrozinski, Minji Kang, Ahmed Khota, Vincent Michael Sutanto, Giovanni Gatti De Giacomo

TL;DR

The paper addresses document-level quality-estimation (QE) reranking to improve MT outputs by selecting the best candidate from larger pools. It evaluates both learned QE metrics (including COMET-based metrics and SLIDE) and large-language-model (LLM) based QE metrics (GEMBA-DA, EAPrompt) across decoder-only LLMs and encoder–decoder MT models, with document-level adaptations such as windowed scoring and full-document evaluation. Results show substantial gains from QE reranking, with SLIDE and GEMBA-DA often delivering the strongest performance, particularly as pool size grows, though gains diminish for very long documents due to token-length limits; complexity advantages over traditional MBR decoding are noted. Practically, QE reranking offers near cost-free improvements when sufficient hardware is available to handle larger candidate pools, and the study highlights the importance of document-level scoring and robust prompting strategies for LLM-based QE.

Abstract

Quality estimation (QE) reranking is a form of quality-aware decoding which aims to improve machine translation (MT) by scoring and selecting the best candidate from a pool of generated translations. While known to be effective at the sentence level, its application to the increasingly prominent domain of document-level translation remains underexplored. In this work, we evaluate QE reranking performance on document-level (rather than the typical sentence-level) translation, using various learned and large language model (LLM)-based QE metrics. We find that with our best learned metric, SLIDE, BLEURT-20 scores improve by +2.00 with only two candidates, and by +5.09 with 32, across both decoder-only LLM models and encoder-decoder neural machine translation (NMT) models. Using the best LLM-based metric, GEMBA-DA, gains of +1.63 and +4.30 are achieved under the same conditions. Although gains shrink with longer inputs, reranking with 32 candidates yields improvements of +2.34 (SLIDE) and +1.40 (GEMBA-DA) on our longest documents (512-1024 source tokens). These findings demonstrate the practical value of document-level QE, with minimal runtime overhead given suitable translation models and hardware.

Quality Estimation Reranking for Document-Level Translation

TL;DR

The paper addresses document-level quality-estimation (QE) reranking to improve MT outputs by selecting the best candidate from larger pools. It evaluates both learned QE metrics (including COMET-based metrics and SLIDE) and large-language-model (LLM) based QE metrics (GEMBA-DA, EAPrompt) across decoder-only LLMs and encoder–decoder MT models, with document-level adaptations such as windowed scoring and full-document evaluation. Results show substantial gains from QE reranking, with SLIDE and GEMBA-DA often delivering the strongest performance, particularly as pool size grows, though gains diminish for very long documents due to token-length limits; complexity advantages over traditional MBR decoding are noted. Practically, QE reranking offers near cost-free improvements when sufficient hardware is available to handle larger candidate pools, and the study highlights the importance of document-level scoring and robust prompting strategies for LLM-based QE.

Abstract

Quality estimation (QE) reranking is a form of quality-aware decoding which aims to improve machine translation (MT) by scoring and selecting the best candidate from a pool of generated translations. While known to be effective at the sentence level, its application to the increasingly prominent domain of document-level translation remains underexplored. In this work, we evaluate QE reranking performance on document-level (rather than the typical sentence-level) translation, using various learned and large language model (LLM)-based QE metrics. We find that with our best learned metric, SLIDE, BLEURT-20 scores improve by +2.00 with only two candidates, and by +5.09 with 32, across both decoder-only LLM models and encoder-decoder neural machine translation (NMT) models. Using the best LLM-based metric, GEMBA-DA, gains of +1.63 and +4.30 are achieved under the same conditions. Although gains shrink with longer inputs, reranking with 32 candidates yields improvements of +2.34 (SLIDE) and +1.40 (GEMBA-DA) on our longest documents (512-1024 source tokens). These findings demonstrate the practical value of document-level QE, with minimal runtime overhead given suitable translation models and hardware.

Paper Structure

This paper contains 14 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: BLEURT-20 scores for QE reranking across different pool sizes, evaluated with all QE metrics and translation models. A pool size of 1 serves as the baseline (no reranking). Scores generally increase with larger pools under most QE metrics, for all translators.
  • Figure 2: Distribution of source token and source sentence counts across our WMT23 dataset. Average example source text is 4.30 sentences and 138 tokens long.
  • Figure 3: QE reranking performance for all QE metrics at pool size 32, averaged across all translator models. Gains diminish with longer documents but remain above the baseline (pool size 1) for most metrics.
  • Figure 4: Runtime by source length and pool size for all QE metrics and translators. Translation runtime rises steeply for models not trained at the document level, while QE runtime remains a small fraction of the overall runtime.