Table of Contents
Fetching ...

Logical Consistency of Large Language Models in Fact-checking

Bishwamittra Ghosh, Sarah Hasan, Naheed Anjum Arafat, Arijit Khan

TL;DR

This work tackles the problem of logical inconsistency in LLMs when performing fact-checking on propositional logic over knowledge graphs. It introduces a retrieval-augmented framework with KG contexts, formalizes binary consistency measures for simple and complex facts (via DNFs) and logic rules, and builds three KG-based benchmarks (FreebaseLFC, NELLLFC, WikiLFC). The study reveals that state-of-the-art LLMs struggle with complex logical queries, and shows that zero-shot prompting is insufficient; supervised fine-tuning with QLoRA yields robust gains in both accuracy (~19%) and consistency (~14%) across diverse facts and rules, and scales efficiently to large KGs. Context-aware retrieval and PEFT enable practical, scalable improvements, with extensions to De Morgan’s laws, first-order logic, and natural-text data like FEVER. Overall, the paper provides a concrete pathway to more trustworthy LLM-based fact-checking in KG-enabled settings and outlines directions for extending these methods to broader logical forms and incomplete knowledge sources.

Abstract

In recent years, large language models (LLMs) have demonstrated significant success in performing varied natural language tasks such as language translation, question-answering, summarizing, fact-checking, etc. Despite LLMs' impressive ability to generate human-like texts, LLMs are infamous for their inconsistent responses - a meaning-preserving change in the input query results in an inconsistent response and attributes to vulnerabilities of LLMs such as hallucination. Consequently, existing research focuses on simple paraphrasing-based consistency assessment of LLMs, and ignores complex queries that necessitate an even better understanding of logical reasoning by an LLM. Our work therefore addresses the logical inconsistency of LLMs under complex logical queries with primitive logical operators, e.g., negation, conjunction, and disjunction. As a test bed, we consider retrieval-augmented LLMs on a fact-checking task involving propositional logic queries from knowledge graphs (KGs). Our contributions are threefold. Benchmark: We introduce three logical fact-checking datasets over KGs for community development towards logically consistent LLMs. Assessment: We propose consistency measures of LLMs on propositional logic queries and demonstrate that existing LLMs lack logical consistency, especially on complex queries. Improvement: We employ supervised fine-tuning to improve the logical consistency of LLMs on the complex fact-checking task with KG contexts. We have made our source code and benchmarks available.

Logical Consistency of Large Language Models in Fact-checking

TL;DR

This work tackles the problem of logical inconsistency in LLMs when performing fact-checking on propositional logic over knowledge graphs. It introduces a retrieval-augmented framework with KG contexts, formalizes binary consistency measures for simple and complex facts (via DNFs) and logic rules, and builds three KG-based benchmarks (FreebaseLFC, NELLLFC, WikiLFC). The study reveals that state-of-the-art LLMs struggle with complex logical queries, and shows that zero-shot prompting is insufficient; supervised fine-tuning with QLoRA yields robust gains in both accuracy (~19%) and consistency (~14%) across diverse facts and rules, and scales efficiently to large KGs. Context-aware retrieval and PEFT enable practical, scalable improvements, with extensions to De Morgan’s laws, first-order logic, and natural-text data like FEVER. Overall, the paper provides a concrete pathway to more trustworthy LLM-based fact-checking in KG-enabled settings and outlines directions for extending these methods to broader logical forms and incomplete knowledge sources.

Abstract

In recent years, large language models (LLMs) have demonstrated significant success in performing varied natural language tasks such as language translation, question-answering, summarizing, fact-checking, etc. Despite LLMs' impressive ability to generate human-like texts, LLMs are infamous for their inconsistent responses - a meaning-preserving change in the input query results in an inconsistent response and attributes to vulnerabilities of LLMs such as hallucination. Consequently, existing research focuses on simple paraphrasing-based consistency assessment of LLMs, and ignores complex queries that necessitate an even better understanding of logical reasoning by an LLM. Our work therefore addresses the logical inconsistency of LLMs under complex logical queries with primitive logical operators, e.g., negation, conjunction, and disjunction. As a test bed, we consider retrieval-augmented LLMs on a fact-checking task involving propositional logic queries from knowledge graphs (KGs). Our contributions are threefold. Benchmark: We introduce three logical fact-checking datasets over KGs for community development towards logically consistent LLMs. Assessment: We propose consistency measures of LLMs on propositional logic queries and demonstrate that existing LLMs lack logical consistency, especially on complex queries. Improvement: We employ supervised fine-tuning to improve the logical consistency of LLMs on the complex fact-checking task with KG contexts. We have made our source code and benchmarks available.

Paper Structure

This paper contains 52 sections, 3 theorems, 13 equations, 11 figures, 24 tables, 2 algorithms.

Key Result

Proposition 1

An LLM is consistent on a DNF fact $q=\vee_{i=1}^n c_i$, where $c_i = \wedge_{j=1}^{i_m} e_{ij}$, if $\mathtt{LLM}(q) = \bigvee_{i=1}^{n} \left(\bigwedge_{j=1}^{i_m} \mathtt{LLM}(e_{ij})\right).$ Here, $e_{ij}$ is an atomic relation fact for any $1\leq i\leq n$ and $1 \leq j \leq {i_m}$.

Figures (11)

  • Figure 1: Our LLM-based fact-checking framework on a simple fact with context from a knowledge graph. A representative $\mathtt{LLMQuery}$ is in Figure \ref{['fig:llmquery_detailed']}, and the extension to complex facts is in Figure \ref{['fig:llmquery_complex']}.
  • Figure 2: The adaptation of LLMQuery from Figure \ref{['fig:llm_zeroshot']} for instruction prompting. The prompt contains a clear instruction (in blue) to guide the LLM towards outputting a correct response.
  • Figure 3: (Continuing Figure \ref{['fig:llm_zeroshot']}) A representative $\mathtt{LLMQuery}$ on a simple fact used in our experiments. The expected LLM response is in blue color.
  • Figure 4: (Continuing Figure \ref{['fig:llm_zeroshot']}) $\mathtt{LLMQuery}$ when considering conjunctive and disjunctive facts. We replace logical operators with their natural language description, such as $\wedge$ with 'and' and $\vee$ with 'or', when constructing $\mathtt{LLMQuery}$.
  • Figure 5: Example of $2$-shot prompt, where representative positive and negative examples are provided to guide the LLM to answer the target fact.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Example 1
  • Definition 1: Correctness of fact-checking via LLM response
  • Example 2
  • Definition 2: Logical Consistency on Primitive Operators
  • Example 3
  • Proposition 1
  • Example 4
  • Proposition 2
  • proof
  • proof
  • ...and 2 more