Table of Contents
Fetching ...

GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking

Yingjian Chen, Haoran Liu, Yinhong Liu, Jinxiang Xie, Rui Yang, Han Yuan, Yanran Fu, Peng Yuan Zhou, Qingyu Chen, James Caverlee, Irene Li

TL;DR

GraphCheck tackles factual errors in long-form LLM outputs by augmenting inputs with knowledge graphs extracted from both the claim and its grounding document. A trainable GNN encodes these graphs and produces embeddings that are projected into the LLM's space, enabling a single, end-to-end verification step with a frozen LLM and graph-informed reasoning. The approach yields a 71.1% balanced accuracy on seven benchmarks, including medical domains, and outperforms several specialized fact-checkers while matching large LLMs at a fraction of the cost. It also provides improved explainability through KG-edge attention visualizations and introduces a synthetic KG-enhanced dataset for future graph-based fact-checking research. Overall, GraphCheck offers a scalable, efficient, and interpretable path to reliable long-text fact-checking with practical implications for high-stakes domains.

Abstract

Large language models (LLMs) are widely used, but they often generate subtle factual errors, especially in long-form text. These errors are fatal in some specialized domains such as medicine. Existing fact-checking with grounding documents methods face two main challenges: (1) they struggle to understand complex multihop relations in long documents, often overlooking subtle factual errors; (2) most specialized methods rely on pairwise comparisons, requiring multiple model calls, leading to high resource and computational costs. To address these challenges, we propose GraphCheck, a fact-checking framework that uses extracted knowledge graphs to enhance text representation. Graph Neural Networks further process these graphs as a soft prompt, enabling LLMs to incorporate structured knowledge more effectively. Enhanced with graph-based reasoning, GraphCheck captures multihop reasoning chains that are often overlooked by existing methods, enabling precise and efficient fact-checking in a single inference call. Experimental results on seven benchmarks spanning both general and medical domains demonstrate up to a 7.1% overall improvement over baseline models. Notably, GraphCheck outperforms existing specialized fact-checkers and achieves comparable performance with state-of-the-art LLMs, such as DeepSeek-V3 and OpenAI-o1, with significantly fewer parameters.

GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking

TL;DR

GraphCheck tackles factual errors in long-form LLM outputs by augmenting inputs with knowledge graphs extracted from both the claim and its grounding document. A trainable GNN encodes these graphs and produces embeddings that are projected into the LLM's space, enabling a single, end-to-end verification step with a frozen LLM and graph-informed reasoning. The approach yields a 71.1% balanced accuracy on seven benchmarks, including medical domains, and outperforms several specialized fact-checkers while matching large LLMs at a fraction of the cost. It also provides improved explainability through KG-edge attention visualizations and introduces a synthetic KG-enhanced dataset for future graph-based fact-checking research. Overall, GraphCheck offers a scalable, efficient, and interpretable path to reliable long-text fact-checking with practical implications for high-stakes domains.

Abstract

Large language models (LLMs) are widely used, but they often generate subtle factual errors, especially in long-form text. These errors are fatal in some specialized domains such as medicine. Existing fact-checking with grounding documents methods face two main challenges: (1) they struggle to understand complex multihop relations in long documents, often overlooking subtle factual errors; (2) most specialized methods rely on pairwise comparisons, requiring multiple model calls, leading to high resource and computational costs. To address these challenges, we propose GraphCheck, a fact-checking framework that uses extracted knowledge graphs to enhance text representation. Graph Neural Networks further process these graphs as a soft prompt, enabling LLMs to incorporate structured knowledge more effectively. Enhanced with graph-based reasoning, GraphCheck captures multihop reasoning chains that are often overlooked by existing methods, enabling precise and efficient fact-checking in a single inference call. Experimental results on seven benchmarks spanning both general and medical domains demonstrate up to a 7.1% overall improvement over baseline models. Notably, GraphCheck outperforms existing specialized fact-checkers and achieves comparable performance with state-of-the-art LLMs, such as DeepSeek-V3 and OpenAI-o1, with significantly fewer parameters.

Paper Structure

This paper contains 26 sections, 3 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Comparison of fact-checking methods. Naive Check performs a single-pass evaluation but often misses detailed factual errors. Atomic Check ensures fine-grained verification by checking atomic facts individually but is inefficient due to multiple LLM calls. In contrast, our GraphCheck achieves fine-grained fact-checking in a single call, significantly improving efficiency while maintaining accuracy.
  • Figure 2: An illustration of the GraphCheck framework. Firstly, an LLM extracts entity-relation triples from both the claim and the document to construct KGs, respectively. A GNN pre-trained with external text graph data is then used to obtain graph embeddings from both KGs. These graph embeddings, combined with the text embeddings, are fed into an LLM for final fact-checking. This approach enables the LLM to perform fine-grained fact-checking by leveraging key triples in the KG (highlighted) alongside the text information.
  • Figure 3: Average BAcc across general and medical domains. We compare our method with the specialized fact-checking methods in general domain (AggreFact-XSum, AggreFact-CNN, Summeval, ExpertQA) and medical domain (COVID-Fact, PubHealth, SCIFact).
  • Figure 4: The BAcc of the base model and the proposed GraphCheck architecture across all seven benchmarks for Llama3 8B, Llama3.3 70B, Qwen2.5 7B, Qwen2.5 72B models. The blue-shaded region represents the base model performance, while the red-shaded region highlights the enhanced performance with GraphCheck.
  • Figure 5: Balanced accuracy comparison across different training data sizes on all benchmarks. The baseline model performance is marked at 0 on the $x$-axis.
  • ...and 6 more figures