Table of Contents
Fetching ...

Word Alignment as Preference for Machine Translation

Qiyu Wu, Masaaki Nagata, Zhongtao Miao, Yoshimasa Tsuruoka

TL;DR

This work proposes to utilize word alignment as preference to optimize the LLM-based MT model by guiding it to better word alignment, and demonstrates the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission.

Abstract

The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena of hallucination and omission in MT. Then we propose to utilize word alignment as preference to optimize the LLM-based MT model. The preference data are constructed by selecting chosen and rejected translations from multiple MT tools. Subsequently, direct preference optimization is used to optimize the LLM-based model towards the preference signal. Given the absence of evaluators specifically designed for hallucination and omission in MT, we further propose selecting hard instances and utilizing GPT-4 to directly evaluate the performance of the models in mitigating these issues. We verify the rationality of these designed evaluation methods by experiments, followed by extensive results demonstrating the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission. On the other hand, although it shows promise in mitigating hallucination and omission, the overall performance of MT in different language directions remains mixed, with slight increases in BLEU and decreases in COMET.

Word Alignment as Preference for Machine Translation

TL;DR

This work proposes to utilize word alignment as preference to optimize the LLM-based MT model by guiding it to better word alignment, and demonstrates the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission.

Abstract

The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena of hallucination and omission in MT. Then we propose to utilize word alignment as preference to optimize the LLM-based MT model. The preference data are constructed by selecting chosen and rejected translations from multiple MT tools. Subsequently, direct preference optimization is used to optimize the LLM-based model towards the preference signal. Given the absence of evaluators specifically designed for hallucination and omission in MT, we further propose selecting hard instances and utilizing GPT-4 to directly evaluate the performance of the models in mitigating these issues. We verify the rationality of these designed evaluation methods by experiments, followed by extensive results demonstrating the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission. On the other hand, although it shows promise in mitigating hallucination and omission, the overall performance of MT in different language directions remains mixed, with slight increases in BLEU and decreases in COMET.
Paper Structure (32 sections, 2 equations, 7 figures, 6 tables)

This paper contains 32 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: A preliminary experiment shows that higher coverage scores correlates to less hallucination and omission. The coverage scores are predicted by a word aligner wu-etal-2023-wspalign. The human annotation of hallucination and omission is from HalOmi benchmark dale-etal-2023-halomi. Details about the dataset and word alignment model can be found in §\ref{['sec:datasets']}.
  • Figure 2: An illustration of WAP framework. The source is first translated by multiple MT tools, including human translation. An external word aligner is then utilized to predict the coverage score for each translation. Finally, translation with the highest and lowest coverage score are selected as preference pairs for preference optimization.
  • Figure 3: The prompt for translating sentences.
  • Figure 4: This figure illustrates the proportions of "chosen" and "rejected" preference pairs derived from three sources: ChatGPT, DeepL and Human. "all" represents the overall proportion for the aggregated dataset. $xx \leftrightarrow en$ is the subset pair of English and another language. Particularly, Google Translate is used for $is \leftrightarrow en$ as an alternative to DeepL.
  • Figure 5: Comparison of WAP and baseline in hard and easy instances. $N$ instances with the lowest COMET score by the baseline are selected from the test set as hard instances, and the remaining are easy instances. Results when $N=100$, $200$ and $500$ are presented. Refer to §\ref{['sec:all_results']} for the full numeric results of the entire test.
  • ...and 2 more figures