Table of Contents
Fetching ...

KScope: A Framework for Characterizing the Knowledge Status of Language Models

Yuxin Xiao, Shan Chen, Jack Gallifant, Danielle Bitterman, Thomas Hartvigsen, Marzyeh Ghassemi

TL;DR

The paper introduces a five-status taxonomy for LLM knowledge (consisting of Consistent Correct, Conflicting Correct, Absent, Conflicting Wrong, and Consistent Wrong) and the KScope hierarchical testing framework to characterize an LLM’s knowledge modes from empirical response distributions. It applies KScope to nine LLMs across four datasets, showing that relevant context substantially increases consistent correct knowledge and that context features related to difficulty, relevance, and familiarity drive successful knowledge updates; performance degrades under noisy retrieval and open-ended questions. Key contributions include formalizing knowledge-status definitions with $|\ mathcal{Y}_p|$ and $y^*\in \\mathcal{Y}_p$, developing a four-step statistical procedure (binomial and multinomial tests plus likelihood ratio refinement) to infer knowledge status, and identifying context augmentation strategies (notably constrained summarization with credibility) that yield an average $4.3\%$ improvement in update success across statuses and models. The framework provides a practical, generalizable approach to diagnosing and improving knowledge updates in retrieval-augmented generation systems and other LLM pipelines, with implications for reliability and safety in high-stakes domains.

Abstract

Characterizing a large language model's (LLM's) knowledge of a given question is challenging. As a result, prior work has primarily examined LLM behavior under knowledge conflicts, where the model's internal parametric memory contradicts information in the external context. However, this does not fully reflect how well the model knows the answer to the question. In this paper, we first introduce a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We then propose KScope, a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and characterizes LLM knowledge into one of these five statuses. We apply KScope to nine LLMs across four datasets and systematically establish: (1) Supporting context narrows knowledge gaps across models. (2) Context features related to difficulty, relevance, and familiarity drive successful knowledge updates. (3) LLMs exhibit similar feature preferences when partially correct or conflicted, but diverge sharply when consistently wrong. (4) Context summarization constrained by our feature analysis, together with enhanced credibility, further improves update effectiveness and generalizes across LLMs.

KScope: A Framework for Characterizing the Knowledge Status of Language Models

TL;DR

The paper introduces a five-status taxonomy for LLM knowledge (consisting of Consistent Correct, Conflicting Correct, Absent, Conflicting Wrong, and Consistent Wrong) and the KScope hierarchical testing framework to characterize an LLM’s knowledge modes from empirical response distributions. It applies KScope to nine LLMs across four datasets, showing that relevant context substantially increases consistent correct knowledge and that context features related to difficulty, relevance, and familiarity drive successful knowledge updates; performance degrades under noisy retrieval and open-ended questions. Key contributions include formalizing knowledge-status definitions with and , developing a four-step statistical procedure (binomial and multinomial tests plus likelihood ratio refinement) to infer knowledge status, and identifying context augmentation strategies (notably constrained summarization with credibility) that yield an average improvement in update success across statuses and models. The framework provides a practical, generalizable approach to diagnosing and improving knowledge updates in retrieval-augmented generation systems and other LLM pipelines, with implications for reliability and safety in high-stakes domains.

Abstract

Characterizing a large language model's (LLM's) knowledge of a given question is challenging. As a result, prior work has primarily examined LLM behavior under knowledge conflicts, where the model's internal parametric memory contradicts information in the external context. However, this does not fully reflect how well the model knows the answer to the question. In this paper, we first introduce a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We then propose KScope, a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and characterizes LLM knowledge into one of these five statuses. We apply KScope to nine LLMs across four datasets and systematically establish: (1) Supporting context narrows knowledge gaps across models. (2) Context features related to difficulty, relevance, and familiarity drive successful knowledge updates. (3) LLMs exhibit similar feature preferences when partially correct or conflicted, but diverge sharply when consistently wrong. (4) Context summarization constrained by our feature analysis, together with enhanced credibility, further improves update effectiveness and generalizes across LLMs.

Paper Structure

This paper contains 22 sections, 23 figures.

Figures (23)

  • Figure 1: We propose a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We illustrate the taxonomy using an LLM's parametric knowledge modes $\mathcal{Y}_p$ in a three-option classification task. This formulation also applies to contextual knowledge modes $\mathcal{Y}_q$ and generalizes to open-ended questions or classification tasks with more options.
  • Figure 2: We propose KScope, a hierarchical testing framework to characterize LLM knowledge into one of the five identified statuses. We note that our framework generalizes to questions with larger support sets by repeating Step 3 to iteratively refine hypotheses about the knowledge mode set.
  • Figure 3: Characterization results from applying the KScope framework to nine LLMs across four datasets. Overall, most LLMs exhibit the highest proportion of consistent correct parametric knowledge status, which further increases when gold context is provided.
  • Figure 4: Context-induced shifts in knowledge status distributions. Supporting context increases the proportion of consistent correct knowledge across all datasets and models. The Llama family and larger models within each family achieve higher proportions of consistent correct knowledge, although the gaps narrow with context.
  • Figure 5: Characterization results from applying KScope to nine LLMs on HotpotQA across different settings. Compared to (b), where gold context in the multi-choice setting enables more consistent correct knowledge, (c) noisy context and (e) the open-ended setting yield lower update success. Without context, the Gemma family shows more absent knowledge in (d) the open-ended setting than in (a) the multi-choice setting, whereas the Llama and Qwen families mostly show the opposite trend.
  • ...and 18 more figures