Table of Contents
Fetching ...

SelfCheck-Eval: A Multi-Module Framework for Zero-Resource Hallucination Detection in Large Language Models

Diyana Muhammed, Giusy Giulia Tuccari, Gollam Rabby, Sören Auer, Sahar Vahdati

TL;DR

SelfCheck-Eval introduces a robust, black-box, zero-resource framework for hallucination detection across LLM outputs, paired with the AIME Math Hallucination Benchmark to probe mathematical reasoning. The framework comprises three complementary modules—Semantic, Specialized Detection, and Contextual Consistency—each contributing distinct signals about factuality and consistency. Across WikiBio and AIME-Math, the study reveals strong detection of incorrect outputs but persistent difficulties validating correct mathematical reasoning, highlighting fundamental domain-specific gaps in current approaches. The results motivate hybrid approaches that combine statistical, semantic, and formal-reasoning verification to enable reliable deployment of LLMs in mathematics-heavy and other high-stakes domains.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, from open-domain question answering to scientific writing, medical decision support, and legal analysis. However, their tendency to generate incorrect or fabricated content, commonly known as hallucinations, represents a critical barrier to reliable deployment in high-stakes domains. Current hallucination detection benchmarks are limited in scope, focusing primarily on general-knowledge domains while neglecting specialised fields where accuracy is paramount. To address this gap, we introduce the AIME Math Hallucination dataset, the first comprehensive benchmark specifically designed for evaluating mathematical reasoning hallucinations. Additionally, we propose SelfCheck-Eval, a LLM-agnostic, black-box hallucination detection framework applicable to both open and closed-source LLMs. Our approach leverages a novel multi-module architecture that integrates three independent detection strategies: the Semantic module, the Specialised Detection module, and the Contextual Consistency module. Our evaluation reveals systematic performance disparities across domains: existing methods perform well on biographical content but struggle significantly with mathematical reasoning, a challenge that persists across NLI fine-tuning, preference learning, and process supervision approaches. These findings highlight the fundamental limitations of current detection methods in mathematical domains and underscore the critical need for specialised, black-box compatible approaches to ensure reliable LLM deployment.

SelfCheck-Eval: A Multi-Module Framework for Zero-Resource Hallucination Detection in Large Language Models

TL;DR

SelfCheck-Eval introduces a robust, black-box, zero-resource framework for hallucination detection across LLM outputs, paired with the AIME Math Hallucination Benchmark to probe mathematical reasoning. The framework comprises three complementary modules—Semantic, Specialized Detection, and Contextual Consistency—each contributing distinct signals about factuality and consistency. Across WikiBio and AIME-Math, the study reveals strong detection of incorrect outputs but persistent difficulties validating correct mathematical reasoning, highlighting fundamental domain-specific gaps in current approaches. The results motivate hybrid approaches that combine statistical, semantic, and formal-reasoning verification to enable reliable deployment of LLMs in mathematics-heavy and other high-stakes domains.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, from open-domain question answering to scientific writing, medical decision support, and legal analysis. However, their tendency to generate incorrect or fabricated content, commonly known as hallucinations, represents a critical barrier to reliable deployment in high-stakes domains. Current hallucination detection benchmarks are limited in scope, focusing primarily on general-knowledge domains while neglecting specialised fields where accuracy is paramount. To address this gap, we introduce the AIME Math Hallucination dataset, the first comprehensive benchmark specifically designed for evaluating mathematical reasoning hallucinations. Additionally, we propose SelfCheck-Eval, a LLM-agnostic, black-box hallucination detection framework applicable to both open and closed-source LLMs. Our approach leverages a novel multi-module architecture that integrates three independent detection strategies: the Semantic module, the Specialised Detection module, and the Contextual Consistency module. Our evaluation reveals systematic performance disparities across domains: existing methods perform well on biographical content but struggle significantly with mathematical reasoning, a challenge that persists across NLI fine-tuning, preference learning, and process supervision approaches. These findings highlight the fundamental limitations of current detection methods in mathematical domains and underscore the critical need for specialised, black-box compatible approaches to ensure reliable LLM deployment.

Paper Structure

This paper contains 51 sections, 9 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Overview of the hallucination detection experiment conducted within the SelfCheck-Eval framework.
  • Figure 2: Overview of the SelfCheck-Eval pipeline with the approaches for hallucination detection: Semantic module, Specialized Detection module (pre-trained and fine-tuned), and Contextual Consistency module.
  • Figure 3: Analysis of LLM capacity impact.