How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency
Danna Zheng, Mirella Lapata, Jeff Z. Pan
TL;DR
<3-5 sentence high-level summary> This work argues that using LLMs as knowledge bases requires evaluating both factuality (accuracy on seen knowledge and abstention on unseen knowledge) and consistency (stable answers for the same facts). It introduces UnseenQA to probe knowledge not present in LLM pretraining and defines metrics NCR, UR, C_correct, C_wrong, NCCR, and IUR to quantify reliability. Through a large-scale study of 26 LLMs on SeenQA and UnseenQA, it reveals trade-offs: larger models are more consistent but often overconfident on unseen or incorrect content; fine-tuning helps unseen performance but hurts seen performance; ICL yields limited gains for seen knowledge. The paper advocates a principled reliability score and comprehensive evaluation to advance robust LLM-based KB systems.
Abstract
Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs), yet current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance. In this work, we rethink the requirements for evaluating reliable LLM-as-KB usage and highlight two essential factors: factuality, ensuring accurate responses to seen and unseen knowledge, and consistency, maintaining stable answers to questions about the same knowledge. We introduce UnseenQA, a dataset designed to assess LLM performance on unseen knowledge, and propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score. Our experiments on 26 LLMs reveal several challenges regarding their use as KBs, underscoring the need for more principled and comprehensive evaluation.
