Towards Uncovering How Large Language Model Works: An Explainability Perspective
Haiyan Zhao, Fan Yang, Bo Shen, Himabindu Lakkaraju, Mengnan Du
TL;DR
By surveying mechanistic interpretability and representation engineering, the paper synthesizes how LLMs store knowledge in neurons, circuits, and representations, and how training dynamics like grokking and memorization shape generalization. It links these insights to practical gains through model editing, pruning, and alignment with human values. The work highlights progress and persistent challenges, including validating circuit-level explanations, scaling analyses to the full parameter space, and safety considerations for deployment. Overall, it provides a structured roadmap for understanding and safely deploying LLMs through explainability-driven interventions.
Abstract
Large language models (LLMs) have led to breakthroughs in language tasks, yet the internal mechanisms that enable their remarkable generalization and reasoning abilities remain opaque. This lack of transparency presents challenges such as hallucinations, toxicity, and misalignment with human values, hindering the safe and beneficial deployment of LLMs. This paper aims to uncover the mechanisms underlying LLM functionality through the lens of explainability. First, we review how knowledge is architecturally composed within LLMs and encoded in their internal parameters via mechanistic interpretability techniques. Then, we summarize how knowledge is embedded in LLM representations by leveraging probing techniques and representation engineering. Additionally, we investigate the training dynamics through a mechanistic perspective to explain phenomena such as grokking and memorization. Lastly, we explore how the insights gained from these explanations can enhance LLM performance through model editing, improve efficiency through pruning, and better align with human values.
