Table of Contents
Fetching ...

Benchmarking Large Language Models for Knowledge Graph Validation

Farzad Shami, Stefano Marchesin, Gianmaria Silvello

TL;DR

This paper tackles the challenge of validating facts in Knowledge Graphs (KGs) by introducing FactCheck, a benchmark that evaluates Large Language Models (LLMs) on KG fact validation across three dimensions: internal knowledge, external evidence via Retrieval-Augmented Generation (RAG), and multi-model consensus. The authors construct a comprehensive evaluation platform using three real-world KG datasets (FactBench, YAGO, DBpedia) and a large RAG corpus (2M+ documents) plus a mock API and an interactive web explorer. Across open-source LLMs (Gemma2, Qwen2.5, Llama3.1, Mistral) and a commercial baseline (GPT-4o mini), results show that while LLMs are promising, they lack stability and reliability for real-world KG validation; RAG can improve accuracy but at substantial computational cost, and multi-model consensus does not consistently outperform the best single model. The work provides a valuable, extensible framework for rigorous, reproducible evaluation of KG fact-validation methods and sets a direction for future research in prompting strategies, retrieval methods, and model architectures tailored to factual KG verification.

Abstract

Knowledge Graphs (KGs) store structured factual knowledge by linking entities through relationships, crucial for many applications. These applications depend on the KG's factual accuracy, so verifying facts is essential, yet challenging. Expert manual verification is ideal but impractical on a large scale. Automated methods show promise but are not ready for real-world KGs. Large Language Models (LLMs) offer potential with their semantic understanding and knowledge access, yet their suitability and effectiveness for KG fact validation remain largely unexplored. In this paper, we introduce FactCheck, a benchmark designed to evaluate LLMs for KG fact validation across three key dimensions: (1) LLMs internal knowledge; (2) external evidence via Retrieval-Augmented Generation (RAG); and (3) aggregated knowledge employing a multi-model consensus strategy. We evaluated open-source and commercial LLMs on three diverse real-world KGs. FactCheck also includes a RAG dataset with 2+ million documents tailored for KG fact validation. Additionally, we offer an interactive exploration platform for analyzing verification decisions. The experimental analyses demonstrate that while LLMs yield promising results, they are still not sufficiently stable and reliable to be used in real-world KG validation scenarios. Integrating external evidence through RAG methods yields fluctuating performance, providing inconsistent improvements over more streamlined approaches -- at higher computational costs. Similarly, strategies based on multi-model consensus do not consistently outperform individual models, underscoring the lack of a one-fits-all solution. These findings further emphasize the need for a benchmark like FactCheck to systematically evaluate and drive progress on this difficult yet crucial task.

Benchmarking Large Language Models for Knowledge Graph Validation

TL;DR

This paper tackles the challenge of validating facts in Knowledge Graphs (KGs) by introducing FactCheck, a benchmark that evaluates Large Language Models (LLMs) on KG fact validation across three dimensions: internal knowledge, external evidence via Retrieval-Augmented Generation (RAG), and multi-model consensus. The authors construct a comprehensive evaluation platform using three real-world KG datasets (FactBench, YAGO, DBpedia) and a large RAG corpus (2M+ documents) plus a mock API and an interactive web explorer. Across open-source LLMs (Gemma2, Qwen2.5, Llama3.1, Mistral) and a commercial baseline (GPT-4o mini), results show that while LLMs are promising, they lack stability and reliability for real-world KG validation; RAG can improve accuracy but at substantial computational cost, and multi-model consensus does not consistently outperform the best single model. The work provides a valuable, extensible framework for rigorous, reproducible evaluation of KG fact-validation methods and sets a direction for future research in prompting strategies, retrieval methods, and model architectures tailored to factual KG verification.

Abstract

Knowledge Graphs (KGs) store structured factual knowledge by linking entities through relationships, crucial for many applications. These applications depend on the KG's factual accuracy, so verifying facts is essential, yet challenging. Expert manual verification is ideal but impractical on a large scale. Automated methods show promise but are not ready for real-world KGs. Large Language Models (LLMs) offer potential with their semantic understanding and knowledge access, yet their suitability and effectiveness for KG fact validation remain largely unexplored. In this paper, we introduce FactCheck, a benchmark designed to evaluate LLMs for KG fact validation across three key dimensions: (1) LLMs internal knowledge; (2) external evidence via Retrieval-Augmented Generation (RAG); and (3) aggregated knowledge employing a multi-model consensus strategy. We evaluated open-source and commercial LLMs on three diverse real-world KGs. FactCheck also includes a RAG dataset with 2+ million documents tailored for KG fact validation. Additionally, we offer an interactive exploration platform for analyzing verification decisions. The experimental analyses demonstrate that while LLMs yield promising results, they are still not sufficiently stable and reliable to be used in real-world KG validation scenarios. Integrating external evidence through RAG methods yields fluctuating performance, providing inconsistent improvements over more streamlined approaches -- at higher computational costs. Similarly, strategies based on multi-model consensus do not consistently outperform individual models, underscoring the lack of a one-fits-all solution. These findings further emphasize the need for a benchmark like FactCheck to systematically evaluate and drive progress on this difficult yet crucial task.
Paper Structure (28 sections, 3 equations, 4 figures, 9 tables)

This paper contains 28 sections, 3 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overall overview of the benchmark.
  • Figure 2: $F1$ scores for benchmark. The left plot displays $F1(T)$ scores, and the right plot displays $F1(F)$ scores. Multi-model consensus results are shown with hatching, and the red dotted line indicates the guess rate.
  • Figure 3: Trade-off analysis between computational cost ($\bar{\theta}$) and verification performance ($F1(F)$ and $F1(T)$). The dashed line represents the Pareto frontier, highlighting configurations that achieve optimal efficiency (highest accuracy for a given time budget).
  • Figure 4: Intersection of correct predictions across models. Bars show the number of correct samples by the specific combination of models indicated by the connected dots below.