Table of Contents
Fetching ...

How Much Do LLMs Hallucinate across Languages? On Realistic Multilingual Estimation of LLM Hallucination

Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš

TL;DR

The paper addresses the challenge of quantifying LLM hallucinations across languages in realistic knowledge-heavy open-domain QA. It develops a multilingual hallucination detection (HD) framework trained via translate-train on the FAVA benchmark and evaluates it on the mFAVA dataset, including Silver and Gold annotations across 30 languages. By combining HD performance with a synthetic, knowledge-intensive evaluation dataset, the authors estimate per-language hallucination rates (HR_est) for 11 LLMs, finding that smaller models and those with broader language support tend to hallucinate more, while larger models are more robust, especially on longer outputs. The work introduces a practical, scalable protocol for cross-language hallucination estimation and reveals important interactions between model size, language coverage, and text length, with implications for deploying multilingual LLMs in real-world settings.

Abstract

In the age of misinformation, hallucination - the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses - represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common in realistic settings than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering (LFQA). To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to translate-train a detection model. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build open-domain QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. Our analysis shows that LLMs, in absolute terms, hallucinate more tokens in high-resource languages due to longer responses, but that the actual hallucination rates (i.e., normalized for length) seems uncorrelated with the sizes of languages' digital footprints. We also find that smaller LLMs hallucinate more, and significantly, LLMs with broader language support display higher hallucination rates.

How Much Do LLMs Hallucinate across Languages? On Realistic Multilingual Estimation of LLM Hallucination

TL;DR

The paper addresses the challenge of quantifying LLM hallucinations across languages in realistic knowledge-heavy open-domain QA. It develops a multilingual hallucination detection (HD) framework trained via translate-train on the FAVA benchmark and evaluates it on the mFAVA dataset, including Silver and Gold annotations across 30 languages. By combining HD performance with a synthetic, knowledge-intensive evaluation dataset, the authors estimate per-language hallucination rates (HR_est) for 11 LLMs, finding that smaller models and those with broader language support tend to hallucinate more, while larger models are more robust, especially on longer outputs. The work introduces a practical, scalable protocol for cross-language hallucination estimation and reveals important interactions between model size, language coverage, and text length, with implications for deploying multilingual LLMs in real-world settings.

Abstract

In the age of misinformation, hallucination - the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses - represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common in realistic settings than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering (LFQA). To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to translate-train a detection model. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build open-domain QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. Our analysis shows that LLMs, in absolute terms, hallucinate more tokens in high-resource languages due to longer responses, but that the actual hallucination rates (i.e., normalized for length) seems uncorrelated with the sizes of languages' digital footprints. We also find that smaller LLMs hallucinate more, and significantly, LLMs with broader language support display higher hallucination rates.

Paper Structure

This paper contains 21 sections, 7 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: Illustration of our approach for estimating hallucination rates. Hallucination Detection and Model Evaluation (left side): (1) We automatically translate the English FAVA mishra2024finegrained dataset to 30 languages and train our multilingual hallucination detection (HD) model on this (noisy) multilingual training data; (2) We synthesize a silver multilingual hallucination evaluation dataset by prompting a state-of-the-art LLM (GPT-4) to introduce hallucinations in its answers to knowledge-seeking questions; for a subset of five high-resource languages, we additionally collect gold (i.e., human) hallucination annotations; we dub this 30-language evaluation benchmark mFAVA. We use mFAVA to estimate HD model's per-language performances (precision and recall). Hallucination Rate Estimation (right side): (3) We estimate the hallucination rates for all 30 languages and six different LLM families from the number of detections of the HD model and its performance.
  • Figure 2: 1) Inter-annotator agreement (IAA) for hallucination span detection (Binary; blue bars) and classification (Category; orange bars) for five high-resource languages; 2) Hallucination span and class agreement between human labels and GPT-4 generated hallucinations (Silver-Gold; agreement on spans only: red bars; agreement on spans and hallucination type: green bars).
  • Figure 3: Comparison of hallucination rate estimates $\mathit{HR}_{\text{est}, l}$ (mean $\pm$ std over five LLM runs) for Arabic (AR), Chinese (ZH), German (DE), Russian (RU), and Turkish (TR) for 3 LLMs based on the estimates of $\mathit{P}_l$ and $\mathit{R}_l$ of the Multi (Bidirect) model on (1) mFAVA-Silver (top row) and (2) mFAVA-Gold (bottom row). The two sets of estimates are highly correlated $(r=0.83, p=1.26e-04)$.
  • Figure 4: Mean estimates of in-the-wild hallucination rates ($\pm$ std) for 30 languages and 11 LLMs. Each mean score is an average of 15 $\mathit{HR}_{\text{est}, l}$ estimates, (3 different HD model instances applied to 5 different LLM responses). Average rates increase from top to bottom (over languages) and from left to right (over LLMs).
  • Figure 5: \ref{['fig:paired-bar-graph']} Larger models hallucinate significantly less than smaller ones. Bars are labeled with $p$-values from $t$-test. \ref{['fig:rate_vs_supported_language']} Correlation between hallucination rates (averaged over all 30 languages) and the officially declared number of supported languages. \ref{['fig:overall']} On average, as response length increases, so do the absolute hallucinations $\mathit{H}_{\text{detected}, l}$.
  • ...and 4 more figures