Table of Contents
Fetching ...

Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning

Ningke Li, Yahui Song, Kailong Wang, Yuekang Li, Ling Shi, Yi Liu, Haoyu Wang

TL;DR

This work tackles fact-conflicting hallucinations ($FCH$) in large language models by introducing Drowzee, a temporal-logic-based metamorphic testing framework. It automatically builds a rich factual knowledge base from sources like Wikipedia, derives additional facts through logical rules, and generates temporally informed QA test cases using Metric Temporal Logic ($MTL$) encoded in Prolog. Two semantic-aware oracles compare LLM outputs against ground truth to detect $FCH$, enabling robust evaluation across nine LLMs and nine domains. Empirically, Drowzee automatically produced over 7,000 non-temporal and 1,800 temporal test cases, revealing substantial $FCH$ rates and revealing that lack of logical reasoning and temporal reasoning are major contributors. The framework, benchmarks, and analysis offer a scalable path to improve LLM reliability and transparency for high-stakes applications.

Abstract

Large language models (LLMs) face the challenge of hallucinations -- outputs that seem coherent but are actually incorrect. A particularly damaging type is fact-conflicting hallucination (FCH), where generated content contradicts established facts. Addressing FCH presents three main challenges: 1) Automatically constructing and maintaining large-scale benchmark datasets is difficult and resource-intensive; 2) Generating complex and efficient test cases that the LLM has not been trained on -- especially those involving intricate temporal features -- is challenging, yet crucial for eliciting hallucinations; and 3) Validating the reasoning behind LLM outputs is inherently difficult, particularly with complex logical relationships, as it requires transparency in the model's decision-making process. This paper presents Drowzee, an innovative end-to-end metamorphic testing framework that utilizes temporal logic to identify fact-conflicting hallucinations (FCH) in large language models (LLMs). Drowzee builds a comprehensive factual knowledge base by crawling sources like Wikipedia and uses automated temporal-logic reasoning to convert this knowledge into a large, extensible set of test cases with ground truth answers. LLMs are tested using these cases through template-based prompts, which require them to generate both answers and reasoning steps. To validate the reasoning, we propose two semantic-aware oracles that compare the semantic structure of LLM outputs to the ground truths. Across nine LLMs in nine different knowledge domains, experimental results show that Drowzee effectively identifies rates of non-temporal-related hallucinations ranging from 24.7% to 59.8%, and rates of temporal-related hallucinations ranging from 16.7% to 39.2%.

Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning

TL;DR

This work tackles fact-conflicting hallucinations () in large language models by introducing Drowzee, a temporal-logic-based metamorphic testing framework. It automatically builds a rich factual knowledge base from sources like Wikipedia, derives additional facts through logical rules, and generates temporally informed QA test cases using Metric Temporal Logic () encoded in Prolog. Two semantic-aware oracles compare LLM outputs against ground truth to detect , enabling robust evaluation across nine LLMs and nine domains. Empirically, Drowzee automatically produced over 7,000 non-temporal and 1,800 temporal test cases, revealing substantial rates and revealing that lack of logical reasoning and temporal reasoning are major contributors. The framework, benchmarks, and analysis offer a scalable path to improve LLM reliability and transparency for high-stakes applications.

Abstract

Large language models (LLMs) face the challenge of hallucinations -- outputs that seem coherent but are actually incorrect. A particularly damaging type is fact-conflicting hallucination (FCH), where generated content contradicts established facts. Addressing FCH presents three main challenges: 1) Automatically constructing and maintaining large-scale benchmark datasets is difficult and resource-intensive; 2) Generating complex and efficient test cases that the LLM has not been trained on -- especially those involving intricate temporal features -- is challenging, yet crucial for eliciting hallucinations; and 3) Validating the reasoning behind LLM outputs is inherently difficult, particularly with complex logical relationships, as it requires transparency in the model's decision-making process. This paper presents Drowzee, an innovative end-to-end metamorphic testing framework that utilizes temporal logic to identify fact-conflicting hallucinations (FCH) in large language models (LLMs). Drowzee builds a comprehensive factual knowledge base by crawling sources like Wikipedia and uses automated temporal-logic reasoning to convert this knowledge into a large, extensible set of test cases with ground truth answers. LLMs are tested using these cases through template-based prompts, which require them to generate both answers and reasoning steps. To validate the reasoning, we propose two semantic-aware oracles that compare the semantic structure of LLM outputs to the ground truths. Across nine LLMs in nine different knowledge domains, experimental results show that Drowzee effectively identifies rates of non-temporal-related hallucinations ranging from 24.7% to 59.8%, and rates of temporal-related hallucinations ranging from 16.7% to 39.2%.

Paper Structure

This paper contains 25 sections, 1 theorem, 7 equations, 20 figures, 4 tables, 3 algorithms.

Key Result

Theorem 3.1

Given any $\mathcal{H}$, $\phi$, and $\phi{\,\mathrel{\leadsto}\,}(\mathit{nm}, \widetilde{Q})$, let $\mathcal{P}{=}\mathcal{H} \texttt{+}\texttt{+} \widetilde{Q}$, we define, (1) Soundness: $\forall\, I$. $\llbracket \mathit{nm}(I) \rrbracket_{ \mathcal{P}} {=} \mathit{true}$, then $\forall\, t\,{\

Figures (20)

  • Figure 1: A Hallucination Output Example
  • Figure 2: Motivating Examples for Automatic Benchmark Construction with Complex Questions
  • Figure 3: Prolog Encoding for $\phi \,{=}\, \mathcal{F}_{[1, 3]}(\mathit{Ben\_10})$
  • Figure 4: Drowzee Overview
  • Figure 5: A Core Syntax of Prolog
  • ...and 15 more figures

Theorems & Definitions (3)

  • Definition 1: A Point-based Semantics for MTL
  • Theorem 3.1: Correctness of the encoding rules
  • proof