Table of Contents
Fetching ...

HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations

Samir Abdaljalil, Hasan Kurban, Erchin Serpedin

TL;DR

HalluVerse25 addresses the pressing need for fine-grained, multilingual benchmarks of LLM hallucinations by constructing a dataset across English, Arabic, and Turkish. The authors combine a Wikidata/SPARQL-based extraction of biographical facts with an automated GPT-4-driven hallucination injection, followed by rigorous human annotation to label entity-, relation-, and sentence-level errors. They provide detailed dataset statistics, representational analyses, and baseline evaluations of multiple LLMs (open and closed) to reveal cross-language detection capabilities and the relative difficulty of sentence-level hallucinations. The benchmark aims to drive research in reliable, multilingual LLM outputs and to support future extensions to more tasks and languages with a transparent, reproducible workflow.

Abstract

Large Language Models (LLMs) are increasingly used in various contexts, yet remain prone to generating non-factual content, commonly referred to as "hallucinations". The literature categorizes hallucinations into several types, including entity-level, relation-level, and sentence-level hallucinations. However, existing hallucination datasets often fail to capture fine-grained hallucinations in multilingual settings. In this work, we introduce HalluVerse25, a multilingual LLM hallucination dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality. We evaluate several LLMs on HalluVerse25, providing valuable insights into how proprietary models perform in detecting LLM-generated hallucinations across different contexts.

HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations

TL;DR

HalluVerse25 addresses the pressing need for fine-grained, multilingual benchmarks of LLM hallucinations by constructing a dataset across English, Arabic, and Turkish. The authors combine a Wikidata/SPARQL-based extraction of biographical facts with an automated GPT-4-driven hallucination injection, followed by rigorous human annotation to label entity-, relation-, and sentence-level errors. They provide detailed dataset statistics, representational analyses, and baseline evaluations of multiple LLMs (open and closed) to reveal cross-language detection capabilities and the relative difficulty of sentence-level hallucinations. The benchmark aims to drive research in reliable, multilingual LLM outputs and to support future extensions to more tasks and languages with a transparent, reproducible workflow.

Abstract

Large Language Models (LLMs) are increasingly used in various contexts, yet remain prone to generating non-factual content, commonly referred to as "hallucinations". The literature categorizes hallucinations into several types, including entity-level, relation-level, and sentence-level hallucinations. However, existing hallucination datasets often fail to capture fine-grained hallucinations in multilingual settings. In this work, we introduce HalluVerse25, a multilingual LLM hallucination dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality. We evaluate several LLMs on HalluVerse25, providing valuable insights into how proprietary models perform in detecting LLM-generated hallucinations across different contexts.

Paper Structure

This paper contains 23 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Pipeline of Dataset Construction, including extraction of biographical factual sentences, automatic injection of fine-grained hallucinations, and human annotation.
  • Figure 2: Overview of entity diversity in the dataset. (a) Distribution of professions, (b) Geographic diversity of entities, (c) Birth year distribution
  • Figure 3: Final dataset hallucination type distribution for (a) English, (b) Arabic, and (c) Turkish data
  • Figure 4: Hallucination Category Confusion Matrices of gpt-4o labels for (a) English, (b) Arabic, and (c) Turkish data