Decoding Knowledge in Large Language Models: A Framework for Categorization and Comprehension
Yanbo Fang, Ruixiang Tang
TL;DR
The paper addresses how large language models acquire, retain, and apply knowledge beyond simple accuracy. It introduces the K-(CSA)^2 framework to categorize knowledge by correctness and confidence and a Category Score metric defined as $\text{Category Score} = \sum_{i=1}^{6} w_i \cdot r_i$ with $w_i = 7 - i$. Through experiments on the HaluEval dataset and across training stages including chain-of-thought prompting, instruction tuning, and RLHF, it reveals that higher layers encode more high-confidence knowledge while uncertain knowledge sits in middle-to-lower layers; external knowledge processing benefits differently from context than internal knowledge. The results show synergistic gains when combining CoT with IT, and they provide quantitative tools to diagnose knowledge gaps, track transitions, and guide targeted improvements in knowledge representation for scalable, real-world deployment.
Abstract
Understanding how large language models (LLMs) acquire, retain, and apply knowledge remains an open challenge. This paper introduces a novel framework, K-(CSA)^2, which categorizes LLM knowledge along two dimensions: correctness and confidence. The framework defines six categories of knowledge, ranging from highly confident correctness to confidently held misconceptions, enabling a nuanced evaluation of model comprehension beyond binary accuracy. Using this framework, we demonstrate how techniques like chain-of-thought prompting and reinforcement learning with human feedback fundamentally alter the knowledge structures of internal (pre-trained) and external (context-dependent) knowledge in LLMs. CoT particularly enhances base model performance and shows synergistic benefits when applied to aligned LLMs. Moreover, our layer-wise analysis reveals that higher layers in LLMs encode more high-confidence knowledge, while low-confidence knowledge tends to emerge in middle-to-lower layers.
