Table of Contents
Fetching ...

How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency

Danna Zheng, Mirella Lapata, Jeff Z. Pan

TL;DR

<3-5 sentence high-level summary> This work argues that using LLMs as knowledge bases requires evaluating both factuality (accuracy on seen knowledge and abstention on unseen knowledge) and consistency (stable answers for the same facts). It introduces UnseenQA to probe knowledge not present in LLM pretraining and defines metrics NCR, UR, C_correct, C_wrong, NCCR, and IUR to quantify reliability. Through a large-scale study of 26 LLMs on SeenQA and UnseenQA, it reveals trade-offs: larger models are more consistent but often overconfident on unseen or incorrect content; fine-tuning helps unseen performance but hurts seen performance; ICL yields limited gains for seen knowledge. The paper advocates a principled reliability score and comprehensive evaluation to advance robust LLM-based KB systems.

Abstract

Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs), yet current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance. In this work, we rethink the requirements for evaluating reliable LLM-as-KB usage and highlight two essential factors: factuality, ensuring accurate responses to seen and unseen knowledge, and consistency, maintaining stable answers to questions about the same knowledge. We introduce UnseenQA, a dataset designed to assess LLM performance on unseen knowledge, and propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score. Our experiments on 26 LLMs reveal several challenges regarding their use as KBs, underscoring the need for more principled and comprehensive evaluation.

How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency

TL;DR

<3-5 sentence high-level summary> This work argues that using LLMs as knowledge bases requires evaluating both factuality (accuracy on seen knowledge and abstention on unseen knowledge) and consistency (stable answers for the same facts). It introduces UnseenQA to probe knowledge not present in LLM pretraining and defines metrics NCR, UR, C_correct, C_wrong, NCCR, and IUR to quantify reliability. Through a large-scale study of 26 LLMs on SeenQA and UnseenQA, it reveals trade-offs: larger models are more consistent but often overconfident on unseen or incorrect content; fine-tuning helps unseen performance but hurts seen performance; ICL yields limited gains for seen knowledge. The paper advocates a principled reliability score and comprehensive evaluation to advance robust LLM-based KB systems.

Abstract

Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs), yet current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance. In this work, we rethink the requirements for evaluating reliable LLM-as-KB usage and highlight two essential factors: factuality, ensuring accurate responses to seen and unseen knowledge, and consistency, maintaining stable answers to questions about the same knowledge. We introduce UnseenQA, a dataset designed to assess LLM performance on unseen knowledge, and propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score. Our experiments on 26 LLMs reveal several challenges regarding their use as KBs, underscoring the need for more principled and comprehensive evaluation.
Paper Structure (43 sections, 10 equations, 12 figures, 11 tables)

This paper contains 43 sections, 10 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: (a) An example illustrating three answer types: correct, wrong, and uninformative. Focusing only on the correct rate incorrectly suggests that Model B is better, even though Model A is more reliable with a similar correct rate and a much lower wrong rate. (b) Illustration of LLM inconsistency with davinci-002 (temperature is set to 0). Questions in the top focus on seen knowledge, with probability distribution mass concentrated on one prediction. Questions in the bottom focus on unseen knowledge, where the distribution is more even. Drawing from such a distribution inevitably leads to inconsistencies. (c) Example computation for consistency score $Cons(q, r)$. The LLM's original answer is shown in green, while distractors are red. (d) An example illustrating how to evaluate LLM-as-KB.
  • Figure 2: Top 15 LLMs ranked by (a) Net Correct Rate, (b) Uninformative Rate, (c) $C_{\textit{correct}}$, (d) $C_{\textit{wrong}}$, (d) Net Consistent Correct Rate, (e) Inconsistent/Uninformative Rate. Values are scaled by 100 (full results in Appendix \ref{['sec:App-ex']}).
  • Figure 3: The impact of (a) model size, (b) fine-tuning, (c) ICL on LLM performance, measured with NCR, UR, C$_{correct}$, C$_{wrong}$, NCCR, and IUR. Different metrics are color-coded. See Appendix \ref{['sec:App-ex']} for more detailed visualization. Values are scaled by 100.
  • Figure 4: The impact of question type on LLM performance, measured by Uninformative Rate (UR) on unseen knowledge (scaled by 100). Questions are grouped by answer type. Higher values have darker shades.
  • Figure 5: Distribution of uninformative responses given by LLMs to questions about unseen knowledge. We report results for the llama3-8b, gemma-7b, and their fine-tuned models (fourth column) but observe similar trends on other models (omitted for the sake of brevity).
  • ...and 7 more figures