Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs
Samir Abdaljalil, Parichit Sharma, Erchin Serpedin, Hasan Kurban
TL;DR
HalluverseM3 introduces a fine-grained, multilingual, multitask hallucination benchmark to study how LLMs generate hallucinations across languages and tasks. The dataset supports four languages (English, Arabic, Hindi, Turkish), two generation tasks (question answering and dialogue summarization), and three hallucination types (entity, relation, sentence) via a controlled injection protocol and human validation. The authors provide a complete construction methodology, perform extensive cross-model evaluations (open-source and proprietary), and reveal task- and language-specific patterns, such as QA being easier to detect than abstractive summarization and English showing the strongest detection performance. This benchmark enables systematic analysis of hallucinations and supports future goals in detection and mitigation for multilingual, multi-task generation systems.
Abstract
Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}.
