Table of Contents
Fetching ...

Towards Uncovering How Large Language Model Works: An Explainability Perspective

Haiyan Zhao, Fan Yang, Bo Shen, Himabindu Lakkaraju, Mengnan Du

TL;DR

By surveying mechanistic interpretability and representation engineering, the paper synthesizes how LLMs store knowledge in neurons, circuits, and representations, and how training dynamics like grokking and memorization shape generalization. It links these insights to practical gains through model editing, pruning, and alignment with human values. The work highlights progress and persistent challenges, including validating circuit-level explanations, scaling analyses to the full parameter space, and safety considerations for deployment. Overall, it provides a structured roadmap for understanding and safely deploying LLMs through explainability-driven interventions.

Abstract

Large language models (LLMs) have led to breakthroughs in language tasks, yet the internal mechanisms that enable their remarkable generalization and reasoning abilities remain opaque. This lack of transparency presents challenges such as hallucinations, toxicity, and misalignment with human values, hindering the safe and beneficial deployment of LLMs. This paper aims to uncover the mechanisms underlying LLM functionality through the lens of explainability. First, we review how knowledge is architecturally composed within LLMs and encoded in their internal parameters via mechanistic interpretability techniques. Then, we summarize how knowledge is embedded in LLM representations by leveraging probing techniques and representation engineering. Additionally, we investigate the training dynamics through a mechanistic perspective to explain phenomena such as grokking and memorization. Lastly, we explore how the insights gained from these explanations can enhance LLM performance through model editing, improve efficiency through pruning, and better align with human values.

Towards Uncovering How Large Language Model Works: An Explainability Perspective

TL;DR

By surveying mechanistic interpretability and representation engineering, the paper synthesizes how LLMs store knowledge in neurons, circuits, and representations, and how training dynamics like grokking and memorization shape generalization. It links these insights to practical gains through model editing, pruning, and alignment with human values. The work highlights progress and persistent challenges, including validating circuit-level explanations, scaling analyses to the full parameter space, and safety considerations for deployment. Overall, it provides a structured roadmap for understanding and safely deploying LLMs through explainability-driven interventions.

Abstract

Large language models (LLMs) have led to breakthroughs in language tasks, yet the internal mechanisms that enable their remarkable generalization and reasoning abilities remain opaque. This lack of transparency presents challenges such as hallucinations, toxicity, and misalignment with human values, hindering the safe and beneficial deployment of LLMs. This paper aims to uncover the mechanisms underlying LLM functionality through the lens of explainability. First, we review how knowledge is architecturally composed within LLMs and encoded in their internal parameters via mechanistic interpretability techniques. Then, we summarize how knowledge is embedded in LLM representations by leveraging probing techniques and representation engineering. Additionally, we investigate the training dynamics through a mechanistic perspective to explain phenomena such as grokking and memorization. Lastly, we explore how the insights gained from these explanations can enhance LLM performance through model editing, improve efficiency through pruning, and better align with human values.
Paper Structure (31 sections, 2 figures)

This paper contains 31 sections, 2 figures.

Figures (2)

  • Figure 1: In this work, we review existing progress on how LLMs work, including: a) how knowledge is architecturally composed within model components; b) what knowledge is encoded in intermediate representations; and c) how generalization abilities are achieved during the training process.
  • Figure 2: An illustration of a Transformer circuit, which is a key concept in mechanistic interpretability.