Table of Contents
Fetching ...

FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs

Albert Sawczyn, Jakub Binkowski, Denis Janiak, Bogdan Gabrys, Tomasz Kajdanowicz

TL;DR

This work introduces FactSelfCheck, a zero-resource, black-box approach for detecting hallucinations at the fact level in LLM outputs by constructing and comparing knowledge-graph-based facts across multiple samples. It presents two scoring variants (FactSelfCheck-KG and FactSelfCheck-Text) and demonstrates that fact-level detection can closely match or exceed sentence-level methods in detection while substantially improving hallucination correction. The authors also release FavaMultiSamples, a new benchmark for evaluating sampling-based hallucination methods, and provide extensive experiments showing competitive performance and practical benefits. The approach offers interpretable, fine-grained insights into factuality without external knowledge sources, broadening applicability across domains and model families, and highlighting pathways for more reliable long-form generation. Despite higher computational cost and dataset limitations, FactSelfCheck advances precise handling of hallucinations and paves the way for future fact-level evaluation datasets and efficiency improvements.

Abstract

Large Language Models (LLMs) frequently generate hallucinated content, posing significant challenges for applications where factuality is crucial. While existing hallucination detection methods typically operate at the sentence level or passage level, we propose FactSelfCheck, a novel zero-resource black-box sampling-based method that enables fine-grained fact-level detection. Our approach represents text as interpretable knowledge graphs consisting of facts in the form of triples, providing clearer insights into content factuality than traditional approaches. Through analyzing factual consistency across multiple LLM responses, we compute fine-grained hallucination scores without requiring external resources or training data. Our evaluation demonstrates that FactSelfCheck performs competitively with leading sentence-level sampling-based methods while providing more detailed and interpretable insights. Most notably, our fact-level approach significantly improves hallucination correction, achieving a 35.5% increase in factual content compared to the baseline, while sentence-level SelfCheckGPT yields only a 10.6% improvement. The granular nature of our detection enables more precise identification and correction of hallucinated content. Additionally, we contribute FavaMultiSamples, a novel dataset that addresses a gap in the field by providing the research community with a second dataset for evaluating sampling-based methods.

FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs

TL;DR

This work introduces FactSelfCheck, a zero-resource, black-box approach for detecting hallucinations at the fact level in LLM outputs by constructing and comparing knowledge-graph-based facts across multiple samples. It presents two scoring variants (FactSelfCheck-KG and FactSelfCheck-Text) and demonstrates that fact-level detection can closely match or exceed sentence-level methods in detection while substantially improving hallucination correction. The authors also release FavaMultiSamples, a new benchmark for evaluating sampling-based hallucination methods, and provide extensive experiments showing competitive performance and practical benefits. The approach offers interpretable, fine-grained insights into factuality without external knowledge sources, broadening applicability across domains and model families, and highlighting pathways for more reliable long-form generation. Despite higher computational cost and dataset limitations, FactSelfCheck advances precise handling of hallucinations and paves the way for future fact-level evaluation datasets and efficiency improvements.

Abstract

Large Language Models (LLMs) frequently generate hallucinated content, posing significant challenges for applications where factuality is crucial. While existing hallucination detection methods typically operate at the sentence level or passage level, we propose FactSelfCheck, a novel zero-resource black-box sampling-based method that enables fine-grained fact-level detection. Our approach represents text as interpretable knowledge graphs consisting of facts in the form of triples, providing clearer insights into content factuality than traditional approaches. Through analyzing factual consistency across multiple LLM responses, we compute fine-grained hallucination scores without requiring external resources or training data. Our evaluation demonstrates that FactSelfCheck performs competitively with leading sentence-level sampling-based methods while providing more detailed and interpretable insights. Most notably, our fact-level approach significantly improves hallucination correction, achieving a 35.5% increase in factual content compared to the baseline, while sentence-level SelfCheckGPT yields only a 10.6% improvement. The granular nature of our detection enables more precise identification and correction of hallucinated content. Additionally, we contribute FavaMultiSamples, a novel dataset that addresses a gap in the field by providing the research community with a second dataset for evaluating sampling-based methods.

Paper Structure

This paper contains 37 sections, 11 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: The pipeline of FactSelfCheck in two variants. For response $p$, entities $\mathcal{E}_p$ and relations $\mathcal{R}_p$ are extracted, followed by the construction of knowledge graphs $KG_p$, for which hallucination scores $\mathcal{H}_{\text{fact}}$ are calculated. Samples' entities $\mathcal{E}_S$ and relations $\mathcal{R}_S$ are created by merging $\mathcal{E}_p$ and $\mathcal{R}_p$ with entities and relations from $KG_p$. For each sample $s$, the knowledge graph $KG_s$ is extracted. FactSelfCheck-KG assesses the consistency between a fact and all $KG_s$. FactSelfCheck-Text assesses the consistency between a fact and all $s$ directly. To obtain sentence-level score $\mathcal{H}_{\text{sentence}}$, fact-level scores are aggregated, as indicated by dashed arrows.
  • Figure 2: WikiBio: Precision-recall curve for the sentence-level hallucination and factuality detection.
  • Figure 3: FavaMultiSamples: Precision-recall curve for the sentence-level hallucination and factuality detection.
  • Figure 4: WikiBio: Impact of sample size on both hallucination and factuality detection performance for different methods.
  • Figure 5: FavaMultiSamples: Impact of sample size on both hallucination and factuality detection performance for different methods.
  • ...and 1 more figures