Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

Xinwei Wu; Heng Liu; Jiang Zhou; Xiaohu Zhao; Linlong Xu; Longyue Wang; Weihua Luo; Kaifu Zhang

Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu, Longyue Wang, Weihua Luo, Kaifu Zhang

TL;DR

The paper addresses translation hallucinations in multilingual LLMs, arguing that existing benchmarks fail to reveal current weaknesses. It introduces a fine-grained taxonomy distinguishing Instruction Detachment from Source Detachment and builds HalloMTBench, a multilingual, human-verified benchmark across 11 EN→X directions using a four-stage pipeline of generation, detection, expert annotation, and quality control. Through evaluation of 17 LLMs, it uncovers distinct hallucination triggers tied to model scale, input length, linguistic biases, and RL-influenced reasoning, showing RL can amplify language mixing and cross-language confusion. The work provides a practical testbed for diagnosing translation failures and offers guidance for improving robustness, with open-source tools and data available for continued research and benchmarking.

Abstract

Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.

Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

TL;DR

Abstract

Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)