Table of Contents
Fetching ...

KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs

Shangshang Zheng, He Bai, Yizhe Zhang, Yi Su, Xiaochuan Niu, Navdeep Jaitly

TL;DR

KGLens presents a Thompson-sampling-based framework to efficiently measure how well large language models align with domain-specific knowledge graphs. By attaching Beta-distributed error probabilities to KG edges (PKG), sampling edges with Thompson-inspired strategies, and generating graph-guided Yes/No and Wh-Questions via GPT-4, the approach focuses evaluation on the most informative edges. QA verification (QAV) distinguishes EASY and HARD modes and uses automated verification to quantify LLM factual alignment with the KG, achieving near-human accuracy (95.7%) in human evaluation across three Wikidata KG domains. The framework enables scalable, edge-level analysis of LLM knowledge with metrics like win rate, zero-sense rate, and all-sense rate, and supports comprehensive analysis by temporal and entity-group attributes, offering practical insights for reducing hallucinations and guiding model improvements.

Abstract

Large Language Models (LLMs) might hallucinate facts, while curated Knowledge Graph (KGs) are typically factually reliable especially with domain-specific knowledge. Measuring the alignment between KGs and LLMs can effectively probe the factualness and identify the knowledge blind spots of LLMs. However, verifying the LLMs over extensive KGs can be expensive. In this paper, we present KGLens, a Thompson-sampling-inspired framework aimed at effectively and efficiently measuring the alignment between KGs and LLMs. KGLens features a graph-guided question generator for converting KGs into natural language, along with a carefully designed importance sampling strategy based on parameterized KG structure to expedite KG traversal. Our simulation experiment compares the brute force method with KGLens under six different sampling methods, demonstrating that our approach achieves superior probing efficiency. Leveraging KGLens, we conducted in-depth analyses of the factual accuracy of ten LLMs across three large domain-specific KGs from Wikidata, composing over 19K edges, 700 relations, and 21K entities. Human evaluation results indicate that KGLens can assess LLMs with a level of accuracy nearly equivalent to that of human annotators, achieving 95.7% of the accuracy rate.

KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs

TL;DR

KGLens presents a Thompson-sampling-based framework to efficiently measure how well large language models align with domain-specific knowledge graphs. By attaching Beta-distributed error probabilities to KG edges (PKG), sampling edges with Thompson-inspired strategies, and generating graph-guided Yes/No and Wh-Questions via GPT-4, the approach focuses evaluation on the most informative edges. QA verification (QAV) distinguishes EASY and HARD modes and uses automated verification to quantify LLM factual alignment with the KG, achieving near-human accuracy (95.7%) in human evaluation across three Wikidata KG domains. The framework enables scalable, edge-level analysis of LLM knowledge with metrics like win rate, zero-sense rate, and all-sense rate, and supports comprehensive analysis by temporal and entity-group attributes, offering practical insights for reducing hallucinations and guiding model improvements.

Abstract

Large Language Models (LLMs) might hallucinate facts, while curated Knowledge Graph (KGs) are typically factually reliable especially with domain-specific knowledge. Measuring the alignment between KGs and LLMs can effectively probe the factualness and identify the knowledge blind spots of LLMs. However, verifying the LLMs over extensive KGs can be expensive. In this paper, we present KGLens, a Thompson-sampling-inspired framework aimed at effectively and efficiently measuring the alignment between KGs and LLMs. KGLens features a graph-guided question generator for converting KGs into natural language, along with a carefully designed importance sampling strategy based on parameterized KG structure to expedite KG traversal. Our simulation experiment compares the brute force method with KGLens under six different sampling methods, demonstrating that our approach achieves superior probing efficiency. Leveraging KGLens, we conducted in-depth analyses of the factual accuracy of ten LLMs across three large domain-specific KGs from Wikidata, composing over 19K edges, 700 relations, and 21K entities. Human evaluation results indicate that KGLens can assess LLMs with a level of accuracy nearly equivalent to that of human annotators, achieving 95.7% of the accuracy rate.
Paper Structure (34 sections, 2 equations, 12 figures, 7 tables)

This paper contains 34 sections, 2 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: KGLens Framework. KGLens starts from the PKG initialization, where each edge is augmented with a beta distribution. Then a batch of edges is sampled based on the edge probability $\theta$. After that, questions are generated from these edges and an LLM will be examined with question answering task. Then we update the beta distribution of PKG edges based on the QA results. We iterate this process until the running metrics are converged.
  • Figure 2: Parameterized KG. The edge color is an estimation in LLM's deficiency to the associated fact.
  • Figure 3: We measure the MSE distance between the ground truth $\theta$ and the estimated $\theta'$ across different sampling method. The vertical line epoch-N shows the number of API requests required for the Brute Force method to complete N full iterations over each edges. The graph consists of 2789 edges in total.
  • Figure 4: Percentage of facts that LLMs always answered correctly and incorrectly. Full results in Tab. \ref{['tab:zero-sense-rate']}, \ref{['tab:all-sense-rate']}.
  • Figure 5: Zero-sense rate across years. Full results in Fig. \ref{['fig:movie-hard-years']}.
  • ...and 7 more figures