Table of Contents
Fetching ...

Are LLMs Really Not Knowledgeable? Mining the Submerged Knowledge in LLMs' Memory

Xingjian Tao, Yiwei Wang, Yujun Cai, Zhicheng Yang, Jing Tang

TL;DR

This work reveals a fundamental gap between knowledge stored in LLM parameters and its surface expression in generated answers. By analyzing token-level logits and introducing the Hits@${k}$ metric, the authors show that LLMs often retain substantial factual knowledge even when outputs are incorrect or uncertain. Newer models demonstrate higher latent knowledge, while larger models do not guarantee greater memory capture, with domain and popularity shaping memory storage. The study also demonstrates a memory-masking effect where cautious decoding and 'unsure' responses suppress correct knowledge, and proposes a two-stage decoding approach to recover hidden answers, offering practical guidance for prompting and decoding in knowledge-intensive tasks.

Abstract

Large language models (LLMs) have shown promise as parametric knowledge bases, but often underperform on question answering (QA) tasks due to hallucinations and uncertainty. While prior work attributes these failures to knowledge gaps in the model's parameters, we uncover a complementary phenomenon: LLMs frequently retain correct knowledge even when generating incorrect or "unsure" answers. By analyzing the token-level output distributions, we find that correct answers often appear among high-probability candidates, despite not being selected. Motivated by this, we propose Hits@k, a novel metric to evaluate latent knowledge retention independent of answer surface form. Our experiments reveal that LLMs possess significantly more factual knowledge than is reflected by standard QA accuracy. Building on these insights, we further examine the prevailing few-shot QA paradigm. We find that prompting strategies which allow "unsure" outputs can inadvertently suppress correct answers by discouraging low-confidence generation. We design a set of quantitative experiments to measure this suppression effect, offering practical guidance for future prompt and decoding design in knowledge-intensive tasks.

Are LLMs Really Not Knowledgeable? Mining the Submerged Knowledge in LLMs' Memory

TL;DR

This work reveals a fundamental gap between knowledge stored in LLM parameters and its surface expression in generated answers. By analyzing token-level logits and introducing the Hits@ metric, the authors show that LLMs often retain substantial factual knowledge even when outputs are incorrect or uncertain. Newer models demonstrate higher latent knowledge, while larger models do not guarantee greater memory capture, with domain and popularity shaping memory storage. The study also demonstrates a memory-masking effect where cautious decoding and 'unsure' responses suppress correct knowledge, and proposes a two-stage decoding approach to recover hidden answers, offering practical guidance for prompting and decoding in knowledge-intensive tasks.

Abstract

Large language models (LLMs) have shown promise as parametric knowledge bases, but often underperform on question answering (QA) tasks due to hallucinations and uncertainty. While prior work attributes these failures to knowledge gaps in the model's parameters, we uncover a complementary phenomenon: LLMs frequently retain correct knowledge even when generating incorrect or "unsure" answers. By analyzing the token-level output distributions, we find that correct answers often appear among high-probability candidates, despite not being selected. Motivated by this, we propose Hits@k, a novel metric to evaluate latent knowledge retention independent of answer surface form. Our experiments reveal that LLMs possess significantly more factual knowledge than is reflected by standard QA accuracy. Building on these insights, we further examine the prevailing few-shot QA paradigm. We find that prompting strategies which allow "unsure" outputs can inadvertently suppress correct answers by discouraging low-confidence generation. We design a set of quantitative experiments to measure this suppression effect, offering practical guidance for future prompt and decoding design in knowledge-intensive tasks.
Paper Structure (36 sections, 2 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 36 sections, 2 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: An example illustrating a scenario where a model possesses potentially correct memories yet fails to provide the correct answer.
  • Figure 2: The Hits@$k$ scores of different large language models on the DBPedia-Head dataset when $k = 100$.
  • Figure 3: The ranking of LLMs based on Accuracy and Hits@$k$ on DBPedia-Head when $k = 100$.
  • Figure 4: For different values of $k$, We report the Hits@$k$ of LLaMA3-8b on the DBpedia dataset.
  • Figure 5: The cumulative distribution of the ranks of Hits@$k$ in the QA task
  • ...and 3 more figures