Table of Contents
Fetching ...

Mitigating Hallucinated Translations in Large Language Models with Hallucination-focused Preference Optimization

Zilu Tang, Rajen Chatterjee, Sarthak Garg

TL;DR

The paper tackles hallucinations in LLM-based machine translation by introducing an intrinsic mitigation pipeline that automatically creates hallucination-focused preference data from monolingual corpora. It combines offline post hoc mitigation to generate high-quality preferred translations and trains using Contrastive Preference Optimization (CPO) with a scaled loss to strongly penalize dispreferred, hallucination-containing outputs. Results show a 96% average reduction in hallucinations across five language pairs and an 89% reduction in zero-shot settings on unseen targets, with translation quality largely preserved when using a mixture of hallucination-focused and translation-quality preferences. This approach offers scalable, detector-agnostic mitigation and demonstrates strong cross-lingual generalization, though it is compute-intensive and evaluated primarily on en→X directions.

Abstract

Machine Translation (MT) is undergoing a paradigm shift, with systems based on fine-tuned large language models (LLM) becoming increasingly competitive with traditional encoder-decoder models trained specifically for translation tasks. However, LLM-based systems are at a higher risk of generating hallucinations, which can severely undermine user's trust and safety. Most prior research on hallucination mitigation focuses on traditional MT models, with solutions that involve post-hoc mitigation - detecting hallucinated translations and re-translating them. While effective, this approach introduces additional complexity in deploying extra tools in production and also increases latency. To address these limitations, we propose a method that intrinsically learns to mitigate hallucinations during the model training phase. Specifically, we introduce a data creation framework to generate hallucination focused preference datasets. Fine-tuning LLMs on these preference datasets reduces the hallucination rate by an average of 96% across five language pairs, while preserving overall translation quality. In a zero-shot setting our approach reduces hallucinations by 89% on an average across three unseen target languages.

Mitigating Hallucinated Translations in Large Language Models with Hallucination-focused Preference Optimization

TL;DR

The paper tackles hallucinations in LLM-based machine translation by introducing an intrinsic mitigation pipeline that automatically creates hallucination-focused preference data from monolingual corpora. It combines offline post hoc mitigation to generate high-quality preferred translations and trains using Contrastive Preference Optimization (CPO) with a scaled loss to strongly penalize dispreferred, hallucination-containing outputs. Results show a 96% average reduction in hallucinations across five language pairs and an 89% reduction in zero-shot settings on unseen targets, with translation quality largely preserved when using a mixture of hallucination-focused and translation-quality preferences. This approach offers scalable, detector-agnostic mitigation and demonstrates strong cross-lingual generalization, though it is compute-intensive and evaluated primarily on en→X directions.

Abstract

Machine Translation (MT) is undergoing a paradigm shift, with systems based on fine-tuned large language models (LLM) becoming increasingly competitive with traditional encoder-decoder models trained specifically for translation tasks. However, LLM-based systems are at a higher risk of generating hallucinations, which can severely undermine user's trust and safety. Most prior research on hallucination mitigation focuses on traditional MT models, with solutions that involve post-hoc mitigation - detecting hallucinated translations and re-translating them. While effective, this approach introduces additional complexity in deploying extra tools in production and also increases latency. To address these limitations, we propose a method that intrinsically learns to mitigate hallucinations during the model training phase. Specifically, we introduce a data creation framework to generate hallucination focused preference datasets. Fine-tuning LLMs on these preference datasets reduces the hallucination rate by an average of 96% across five language pairs, while preserving overall translation quality. In a zero-shot setting our approach reduces hallucinations by 89% on an average across three unseen target languages.

Paper Structure

This paper contains 49 sections, 17 equations, 5 figures, 28 tables.

Figures (5)

  • Figure 1: Distribution of the HS on $\mathcal{D}_m^{test}$.
  • Figure 2: Hallucination score (HS) distribution for ALMA-7B-R and $\mathcal{M}_{p+a}$ on $\mathcal{D}_m^{test}$. Right plots are zoomed-in on hallucination regions.
  • Figure 3: COMET score (Unbabel/wmt22-cometkiwi-da) distribution for ALMA-7B-R and $\mathcal{M}_{p+a}$ on $\mathcal{D}_m^{test}$.
  • Figure 4: Regression plots showing hallucination score (HS) for ALMA-7B-R and $\mathcal{M}_{p+a}$ on $\mathcal{D}_m^{test}$.
  • Figure 5: Regression plots showing COMET score (Unbabel/wmt22-cometkiwi-da) for ALMA-7B-R and $\mathcal{M}_{p+a}$ on $\mathcal{D}_m^{test}$.