Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation

Kamil Guttmann; Mikołaj Pokrywka; Adrian Charkiewicz; Artur Nowakowski

Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation

Kamil Guttmann, Mikołaj Pokrywka, Adrian Charkiewicz, Artur Nowakowski

TL;DR

This work demonstrates that Minimum Bayes Risk decoding guided by neural quality metrics (COMET and AfriCOMET) can drive self-improvement in neural MT across domain adaptation and low-resource settings. By generating synthetic parallel data from monolingual source text and fine-tuning on MBR-selected forward translations, the authors achieve consistent translation-quality gains in English–German (biomedical), Czech–Ukrainian (low-resource), and English–Hausa (low-resource). They show that a beam-search-based MBR pipeline with around 50 candidates provides a good trade-off between performance and efficiency, and that iterative self-improvement can yield further gains but risks overfitting to the utility metric. The results underscore the practicality of COMET-guided MBR for domain-specific and cross-lertilized MT improvements and highlight the potential benefits of language-specific evaluation metrics in low-resource scenarios.

Abstract

This paper explores Minimum Bayes Risk (MBR) decoding for self-improvement in machine translation (MT), particularly for domain adaptation and low-resource languages. We implement the self-improvement process by fine-tuning the model on its MBR-decoded forward translations. By employing COMET as the MBR utility metric, we aim to achieve the reranking of translations that better aligns with human preferences. The paper explores the iterative application of this approach and the potential need for language-specific MBR utility metrics. The results demonstrate significant enhancements in translation quality for all examined language pairs, including successful application to domain-adapted models and generalisation to low-resource settings. This highlights the potential of COMET-guided MBR for efficient MT self-improvement in various scenarios.

Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation

TL;DR

Abstract

Paper Structure (24 sections, 12 figures, 18 tables)

This paper contains 24 sections, 12 figures, 18 tables.

Introduction
Related Work
MBR and QE reranking with neural metrics
Model self-improvement
Experiment Overview
Model Self-Improvement
English--German
Czech--Ukrainian
English--Hausa
Iterative MBR Self-Improvement
Experimental Setup
Data Filtering
Vocabulary
Baseline Model Hyperparameters
Evaluation metrics
...and 9 more sections

Figures (12)

Figure 1: Comparison of beam search and top-k algorithms of the Mix-tune English--German model for the khresmoi test set. Top-k algorithm with temperature 1.0 showed superior performance on neural metrics over top-k with temperature 0.1 and slightly better performance than beam search. However, beam search achieved the highest score on the chrF metric, while the top-k algorithm with temperature 1.0 had the lowest score (translation without MBR decoding is represented on the chart as the number of translation candidates equal to 0).
Figure 2: Comparison of beam search and top-k algorithms of the baseline Czech--Ukrainian model for the FLORES-200 test set. Beam search seems to be the superior option with the best performance on chrF and BLEURT metrics and slightly worse results on COMET over top-k with temperature 1.0 (translation without MBR decoding is represented on the chart as the number of translation candidates equal to 0).
Figure 3: Comparison of beam search performance with a different number of samples of the Mix-tune English--German model for the khresmoi test set. Initial increases in the number of samples for MBR decoding showed very rapid gains, but further increases no longer resulted in such large gains, and performance on the n-gram metrics deteriorated (translation without MBR decoding is represented on the chart as the number of translation candidates equal to 0).
Figure 4: Comparison of beam search and top-k algorithms of the Mix-tune English--German model for the khresmoi test set. Top-k algorithm with temperature 1.0 showed superior performance on neural metrics over top-k with temperature 0.1 and slightly better performance than beam search. However, beam search achieved the highest score on the chrF metric, while the top-k algorithm with temperature 1.0 had the lowest score for lexical metrics (translation without MBR decoding is represented on the chart as the number of translation candidates equal to 0).
Figure 5: Comparison of beam search and top-k algorithms of the baseline Czech--Ukrainian model for the FLORES-200 test set. Beam search seems to be the superior option with the best performance on every metric except COMET (translation without MBR decoding is represented on the chart as the number of translation candidates equal to 0).
...and 7 more figures

Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation

TL;DR

Abstract

Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)