Table of Contents
Fetching ...

Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu, Longyue Wang, Weihua Luo, Kaifu Zhang

TL;DR

The paper addresses translation hallucinations in multilingual LLMs, arguing that existing benchmarks fail to reveal current weaknesses. It introduces a fine-grained taxonomy distinguishing Instruction Detachment from Source Detachment and builds HalloMTBench, a multilingual, human-verified benchmark across 11 EN→X directions using a four-stage pipeline of generation, detection, expert annotation, and quality control. Through evaluation of 17 LLMs, it uncovers distinct hallucination triggers tied to model scale, input length, linguistic biases, and RL-influenced reasoning, showing RL can amplify language mixing and cross-language confusion. The work provides a practical testbed for diagnosing translation failures and offers guidance for improving robustness, with open-source tools and data available for continued research and benchmarking.

Abstract

Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.

Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

TL;DR

The paper addresses translation hallucinations in multilingual LLMs, arguing that existing benchmarks fail to reveal current weaknesses. It introduces a fine-grained taxonomy distinguishing Instruction Detachment from Source Detachment and builds HalloMTBench, a multilingual, human-verified benchmark across 11 EN→X directions using a four-stage pipeline of generation, detection, expert annotation, and quality control. Through evaluation of 17 LLMs, it uncovers distinct hallucination triggers tied to model scale, input length, linguistic biases, and RL-influenced reasoning, showing RL can amplify language mixing and cross-language confusion. The work provides a practical testbed for diagnosing translation failures and offers guidance for improving robustness, with open-source tools and data available for continued research and benchmarking.

Abstract

Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.

Paper Structure

This paper contains 32 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Obsolescence of existing MT hallucination benchmarks. While leading LLMs achieve a 0% hallucination rate on established datasets, they exhibit significant hallucination on our proposed benchmark, HalloMTBench.
  • Figure 2: An example of an "Incorrect Language" hallucination data instance from HalloMTBench.
  • Figure 3: Language pair distribution in HalloMTBench dataset. The chart shows the proportion of each English-to-X ('en-xx') translation direction.
  • Figure 4: Distribution of hallucination types for selected models on our test set. Each stacked bar shows the normalized proportion of different hallucination categories, based on our proposed taxonomy.
  • Figure 5: Overall hallucination rates for each model across our entire benchmark.
  • ...and 4 more figures