HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations
Samir Abdaljalil, Hasan Kurban, Erchin Serpedin
TL;DR
HalluVerse25 addresses the pressing need for fine-grained, multilingual benchmarks of LLM hallucinations by constructing a dataset across English, Arabic, and Turkish. The authors combine a Wikidata/SPARQL-based extraction of biographical facts with an automated GPT-4-driven hallucination injection, followed by rigorous human annotation to label entity-, relation-, and sentence-level errors. They provide detailed dataset statistics, representational analyses, and baseline evaluations of multiple LLMs (open and closed) to reveal cross-language detection capabilities and the relative difficulty of sentence-level hallucinations. The benchmark aims to drive research in reliable, multilingual LLM outputs and to support future extensions to more tasks and languages with a transparent, reproducible workflow.
Abstract
Large Language Models (LLMs) are increasingly used in various contexts, yet remain prone to generating non-factual content, commonly referred to as "hallucinations". The literature categorizes hallucinations into several types, including entity-level, relation-level, and sentence-level hallucinations. However, existing hallucination datasets often fail to capture fine-grained hallucinations in multilingual settings. In this work, we introduce HalluVerse25, a multilingual LLM hallucination dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality. We evaluate several LLMs on HalluVerse25, providing valuable insights into how proprietary models perform in detecting LLM-generated hallucinations across different contexts.
