Explainability for Large Language Models: A Survey

Haiyan Zhao; Hanjie Chen; Fan Yang; Ninghao Liu; Huiqi Deng; Hengyi Cai; Shuaiqiang Wang; Dawei Yin; Mengnan Du

Explainability for Large Language Models: A Survey

Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Mengnan Du

TL;DR

This survey maps explainability for Transformer-based LLMs into two main training paradigms (traditional fine-tuning and prompting), detailing local/global explanation methods, evaluation criteria, and practical uses for debugging and improvement. It highlights a rich taxonomy of techniques (feature attribution, attention-based, example-based, probing, and mechanistic interpretability) and discusses how explanations influence model reliability, safety, and deployment. The authors also address evaluation challenges, emergent abilities, data-model interactions, and safety considerations, offering guidance on future research directions. Overall, the work clarifies how explainability can both illuminate model behavior and drive safer, more trustworthy AI systems at scale.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. However, their internal mechanisms are still unclear and this lack of transparency poses unwanted risks for downstream applications. Therefore, understanding and explaining these models is crucial for elucidating their behaviors, limitations, and social impacts. In this paper, we introduce a taxonomy of explainability techniques and provide a structured overview of methods for explaining Transformer-based language models. We categorize techniques based on the training paradigms of LLMs: traditional fine-tuning-based paradigm and prompting-based paradigm. For each paradigm, we summarize the goals and dominant approaches for generating local explanations of individual predictions and global explanations of overall model knowledge. We also discuss metrics for evaluating generated explanations, and discuss how explanations can be leveraged to debug models and improve performance. Lastly, we examine key challenges and emerging opportunities for explanation techniques in the era of LLMs in comparison to conventional machine learning models.

Explainability for Large Language Models: A Survey

TL;DR

Abstract

Paper Structure (60 sections, 5 figures)

This paper contains 60 sections, 5 figures.

Introduction
Training Paradigms of LLMs
Traditional Fine-Tuning Paradigm
Prompting Paradigm
Explanation for Traditional Fine-Tuning Paradigm
Local Explanation
Feature Attribution-Based Explanation
Perturbation-Based Explanation
Gradient-Based Explanation
Surrogate Models
Decomposition-Based Methods
Attention-Based Explanation
Visualizations
Function-Based methods
Debate Over Attention
...and 45 more sections

Figures (5)

Figure 1: We categorize LLM explainability into two major paradigms. Based on this categorization, we summarize different kinds of explainability techniques associated with LLMs belonging to these two paradigms. We also discuss evaluations for the generated explanations under the two paradigms.
Figure 2: LLMs undergo unsupervised pre-training with random initialization to create a base model. The base model can then be fine-tuned through instruction tuning and RLHF to produce the assistant model.
Figure 3: Local explanation is composed of four subareas. The organization of each subarea and examples for certain individual explanation methodology have been given. (a) Bipartite graph attention representation for attention matrix between sentence A and sentence B at the 6th layer vig_bertviz_2019; (b) Perturb the question by deleting "did", the confidence of the answer "Colorado Springs experiments" has even increased for the reduced question while the answer is nonsense for human feng_pathologies_2018; (c) Shapley values for transformer-based language models chen_algorithms_2023; (d) Provide explanation to the important components of input text to assist in commonsense reasoning rajani_explain_2019; (e) Provide negative examples of input text to test model's ability in sentiment prediction and can also be used to improve model performance wu_polyjuice_2021; (f) Change the input text in an imperceptible way for humans but the classification is distracted from the original jin2019bert.
Figure 4: Bipartite graph attention representation and heatmap for attention matrix.
Figure 5: Activation visualization of the 131st neuron in the 5th layer of the GPT-2. The simulated explanation from GPT-4 indicates that the 131st neuron in the fifth layer of GPT-2 is activated by citations. The real activation of this neuron confirms the accuracy of the simulated explanation provided by GPT-4.

Explainability for Large Language Models: A Survey

TL;DR

Abstract

Explainability for Large Language Models: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (5)