Table of Contents
Fetching ...

Decoding Knowledge in Large Language Models: A Framework for Categorization and Comprehension

Yanbo Fang, Ruixiang Tang

TL;DR

The paper addresses how large language models acquire, retain, and apply knowledge beyond simple accuracy. It introduces the K-(CSA)^2 framework to categorize knowledge by correctness and confidence and a Category Score metric defined as $\text{Category Score} = \sum_{i=1}^{6} w_i \cdot r_i$ with $w_i = 7 - i$. Through experiments on the HaluEval dataset and across training stages including chain-of-thought prompting, instruction tuning, and RLHF, it reveals that higher layers encode more high-confidence knowledge while uncertain knowledge sits in middle-to-lower layers; external knowledge processing benefits differently from context than internal knowledge. The results show synergistic gains when combining CoT with IT, and they provide quantitative tools to diagnose knowledge gaps, track transitions, and guide targeted improvements in knowledge representation for scalable, real-world deployment.

Abstract

Understanding how large language models (LLMs) acquire, retain, and apply knowledge remains an open challenge. This paper introduces a novel framework, K-(CSA)^2, which categorizes LLM knowledge along two dimensions: correctness and confidence. The framework defines six categories of knowledge, ranging from highly confident correctness to confidently held misconceptions, enabling a nuanced evaluation of model comprehension beyond binary accuracy. Using this framework, we demonstrate how techniques like chain-of-thought prompting and reinforcement learning with human feedback fundamentally alter the knowledge structures of internal (pre-trained) and external (context-dependent) knowledge in LLMs. CoT particularly enhances base model performance and shows synergistic benefits when applied to aligned LLMs. Moreover, our layer-wise analysis reveals that higher layers in LLMs encode more high-confidence knowledge, while low-confidence knowledge tends to emerge in middle-to-lower layers.

Decoding Knowledge in Large Language Models: A Framework for Categorization and Comprehension

TL;DR

The paper addresses how large language models acquire, retain, and apply knowledge beyond simple accuracy. It introduces the K-(CSA)^2 framework to categorize knowledge by correctness and confidence and a Category Score metric defined as with . Through experiments on the HaluEval dataset and across training stages including chain-of-thought prompting, instruction tuning, and RLHF, it reveals that higher layers encode more high-confidence knowledge while uncertain knowledge sits in middle-to-lower layers; external knowledge processing benefits differently from context than internal knowledge. The results show synergistic gains when combining CoT with IT, and they provide quantitative tools to diagnose knowledge gaps, track transitions, and guide targeted improvements in knowledge representation for scalable, real-world deployment.

Abstract

Understanding how large language models (LLMs) acquire, retain, and apply knowledge remains an open challenge. This paper introduces a novel framework, K-(CSA)^2, which categorizes LLM knowledge along two dimensions: correctness and confidence. The framework defines six categories of knowledge, ranging from highly confident correctness to confidently held misconceptions, enabling a nuanced evaluation of model comprehension beyond binary accuracy. Using this framework, we demonstrate how techniques like chain-of-thought prompting and reinforcement learning with human feedback fundamentally alter the knowledge structures of internal (pre-trained) and external (context-dependent) knowledge in LLMs. CoT particularly enhances base model performance and shows synergistic benefits when applied to aligned LLMs. Moreover, our layer-wise analysis reveals that higher layers in LLMs encode more high-confidence knowledge, while low-confidence knowledge tends to emerge in middle-to-lower layers.
Paper Structure (20 sections, 2 equations, 14 figures, 3 tables)

This paper contains 20 sections, 2 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Illustration of our framework, K-(CSA)$^2$, for categorizing knowledge comprehension in LLMs. The framework separates model responses into six categories: three for known knowledge (Highly Known (HK), Maybe Known (MK), Weakly Known (WK)) and three for unknown knowledge (Unconfident Unknown (UU), Mayconfident Unknown (MU), Confident Unknown (CU)). Greedy decoding represents deterministic answers, while random sampling introduces variability. Categories are defined based on both correctness and confidence in the model's responses, with responses mapped to categories based on confidence and correctness. The language model is represented by $M$, while $T$ represents the temperature. $q$ is the input question and $a$ is the LLM answer for $q$. The correctness probability is denoted by $P_{\textrm{Correctness}}$, and the confidence level is indicated by $P_{\textrm{Confidence}}$.
  • Figure 2: Comparative analysis of model internal knowledge performance across different variations. Each subplot demonstrates performance differences between model variants, measuring changes in both accuracy (x-axis) and category score (y-axis). From left to right: (1) IT versus base models, showing a general decrease in performance; (2) CoT versus base models, indicating moderate improvements; (3) IT+CoT versus base models, revealing substantial gains; (4) IT+CoT versus IT, highlighting the additive benefits of CoT; and (5) IT+CoT versus CoT, showing complementary effects of combining both techniques. The scattered points represent different models, with their relative positions indicating the magnitude and direction of performance changes. Positive values on both axes indicate improvement over the comparison baseline, while negative values suggest performance degradation. Base: Base model, IT: instructed model, CoT: chain-of-thought.
  • Figure 3: Internal knowledge categories structure across models, sorted left to right by increasing combined accuracy (ratios of top-2 layers 1.HK + 2.MK). The values within each section represent the ratio of knowledge points belonging to each category at different training steps. There are six knowledge categories, each represented by a different color.
  • Figure 4: Average transition patterns across all evaluated models, comparing how CoT and IT+CoT affect category transitions relative to the base model. Results are shown separately for (a) internal knowledge and (b) external knowledge, with each bar representing the mean stable, downgrade, and upgrade ratios for each knowledge category.
  • Figure 5: Mean category transition patterns for individual models, showing averaged effects of CoT and IT+CoT on (a) internal and (b) external knowledge performance relative to base models.
  • ...and 9 more figures