Table of Contents
Fetching ...

A Preference-driven Paradigm for Enhanced Translation with Large Language Models

Dawei Zhu, Sony Trenous, Xiaoyu Shen, Dietrich Klakow, Bill Byrne, Eva Hasler

TL;DR

The paper tackles the SFT plateau in large-language-model translation caused by token-level imitation of noisy references. It introduces a preference-learning paradigm based on the Plackett-Luce model, augmented with a distance-aware loss to incorporate quantified quality differences between translations. A new dataset, MAPLE, provides five translations per source with real-valued human scores, enabling supervision beyond gold references. Empirical results show that preference learning on MAPLE consistently improves translation quality across multiple directions and LLMs, and the approach can reuse MAPLE data to enhance other models, offering a practical pathway to break the SFT plateau and calibrate model preferences toward human judgments.

Abstract

Recent research has shown that large language models (LLMs) can achieve remarkable translation performance through supervised fine-tuning (SFT) using only a small amount of parallel data. However, SFT simply instructs the model to imitate the reference translations at the token level, making it vulnerable to the noise present in the references. Hence, the assistance from SFT often reaches a plateau once the LLMs have achieved a certain level of translation capability, and further increasing the size of parallel data does not provide additional benefits. To overcome this plateau associated with imitation-based SFT, we propose a preference-based approach built upon the Plackett-Luce model. The objective is to steer LLMs towards a more nuanced understanding of translation preferences from a holistic view, while also being more resilient in the absence of gold translations. We further build a dataset named MAPLE to verify the effectiveness of our approach, which includes multiple translations of varying quality for each source sentence. Extensive experiments demonstrate the superiority of our approach in "breaking the plateau" across diverse LLMs and test settings. Our in-depth analysis underscores the pivotal role of diverse translations and accurate preference scores in the success of our approach.

A Preference-driven Paradigm for Enhanced Translation with Large Language Models

TL;DR

The paper tackles the SFT plateau in large-language-model translation caused by token-level imitation of noisy references. It introduces a preference-learning paradigm based on the Plackett-Luce model, augmented with a distance-aware loss to incorporate quantified quality differences between translations. A new dataset, MAPLE, provides five translations per source with real-valued human scores, enabling supervision beyond gold references. Empirical results show that preference learning on MAPLE consistently improves translation quality across multiple directions and LLMs, and the approach can reuse MAPLE data to enhance other models, offering a practical pathway to break the SFT plateau and calibrate model preferences toward human judgments.

Abstract

Recent research has shown that large language models (LLMs) can achieve remarkable translation performance through supervised fine-tuning (SFT) using only a small amount of parallel data. However, SFT simply instructs the model to imitate the reference translations at the token level, making it vulnerable to the noise present in the references. Hence, the assistance from SFT often reaches a plateau once the LLMs have achieved a certain level of translation capability, and further increasing the size of parallel data does not provide additional benefits. To overcome this plateau associated with imitation-based SFT, we propose a preference-based approach built upon the Plackett-Luce model. The objective is to steer LLMs towards a more nuanced understanding of translation preferences from a holistic view, while also being more resilient in the absence of gold translations. We further build a dataset named MAPLE to verify the effectiveness of our approach, which includes multiple translations of varying quality for each source sentence. Extensive experiments demonstrate the superiority of our approach in "breaking the plateau" across diverse LLMs and test settings. Our in-depth analysis underscores the pivotal role of diverse translations and accurate preference scores in the success of our approach.
Paper Structure (45 sections, 13 equations, 5 figures, 14 tables)

This paper contains 45 sections, 13 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Human score distribution of translations by rank (left) and source (right).
  • Figure 2: Performance comparison between PL using 4.4K examples from MAPLE and SFT, employing up to 1.4M parallel data. Evaluation is done on WMT22, and COMET scores are averaged across four translation directions. Performing SFT on more parallel data does not always lead to performance gain. PL consistently outperforms SFT in all cases.
  • Figure 3: Model performance varying number of translations ($K$) per source sentence. Evaluation conducted on WMT22 and COMET scores averaged across four translation directions are reported. Reverse mode selects more diverse translations and achieves better performance, especially when fewer translations are provided.
  • Figure 4: Sentence-level correlation between model generation probability and human preference scores varying number of translations ($K$). PL helps the model align better with human judgement.
  • Figure 5: User interface of translation assessment.