MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations
Ernests Lavrinovics, Russa Biswas, Katja Hose, Johannes Bjerva
TL;DR
MultiHal introduces a multilingual, knowledge-graph grounded benchmark for evaluating and mitigating LLM hallucinations. By mining Wikidata KG paths from 31k questions across 140k candidate paths and filtering to 25.9k high-quality paths with an LLM-based judge, it enables KG-RAG evaluation across five languages using open- and closed-source models. The results show consistent improvements in semantic similarity, NLI entailment, and hallucination detection when KG paths are injected as in-context knowledge, demonstrating the value of structured factual grounding for multilingual factuality. The work provides a scalable framework and open resources to advance graph-based fact-checking, knowledge injection, and robust LLM deployment in multilingual settings.
Abstract
Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called MultiHal framed for generative text evaluation. As part of our data collection pipeline, we mined 140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths, curating a high-quality subset of 25.9k. Our baseline evaluation shows an absolute scale improvement by approximately 0.12 to 0.36 points for the semantic similarity score, 0.16 to 0.36 for NLI entailment and 0.29 to 0.42 for hallucination detection in KG-RAG over vanilla QA across multiple languages and multiple models, demonstrating the potential of KG integration. We anticipate MultiHal will foster future research towards several graph-based hallucination mitigation and fact-checking tasks.
