Table of Contents
Fetching ...

Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models

Kenza Benkirane, Laura Gongas, Shahar Pelles, Naomi Fuchs, Joshua Darmon, Pontus Stenetorp, David Ifeoluwa Adelani, Eduardo Sánchez

TL;DR

The key takeaway from this study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task, however, their advantage is less significant for LRLs.

Abstract

Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates sentence-level hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.

Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models

TL;DR

The key takeaway from this study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task, however, their advantage is less significant for LRLs.

Abstract

Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates sentence-level hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.
Paper Structure (41 sections, 16 figures, 6 tables)

This paper contains 41 sections, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Illustration of how a selection of the evaluated methods perform from Yoruba to Spanish and from Arabic to English.
  • Figure 2: MCC average score across high and low resource levels, for different directions. The best performing models differ significantlly between HRLs and LRLs. For HRLs, Llama3-70B greatly outperforms other methods, whereas for LRLs, best performers differ from and to LRLs, with Claude and GPT models closely competing. Embeddings demonstrate impressive results, particularly for the EN$\rightarrow$HRL directions.
  • Figure 3: Binary detection prompt sample.
  • Figure 4: Selection type distribution This graph shows that the three EN$\rightarrow$LRLs not only have more sentences, but also have way more biased sentences than other diretions, which suggests a higher propensity to hallucinate.
  • Figure 5: Severity Ranking Prompt 1 - from G-Eval
  • ...and 11 more figures