From Understanding to Utilization: A Survey on Explainability for Large Language Models

Haoyan Luo; Lucia Specia

From Understanding to Utilization: A Survey on Explainability for Large Language Models

Haoyan Luo, Lucia Specia

TL;DR

The paper surveys explainability for large language models, focusing on pre-trained Transformer LLMs and the challenges of transparency in these black-box systems. It organizes explanations into local (per-instance attributions and transformer-sublayer analyses) and global (probing and mechanistic interpretability) perspectives, and links them to practical uses such as model editing, long-text handling, and controllable generation. Key contributions include a synthesis of methods, evaluation approaches, and datasets for plausibility and truthfulness, along with guidance for future work toward trustworthy alignment and responsible deployment. By grounding explanations in concrete model components like $L$ transformer layers and $d$-dimensional hidden states, the survey provides a concrete framework to advance explainability in the LLM era.

Abstract

Explainability for Large Language Models (LLMs) is a critical yet challenging aspect of natural language processing. As LLMs are increasingly integral to diverse applications, their "black-box" nature sparks significant concerns regarding transparency and ethical use. This survey underscores the imperative for increased explainability in LLMs, delving into both the research on explainability and the various methodologies and tasks that utilize an understanding of these models. Our focus is primarily on pre-trained Transformer-based LLMs, such as LLaMA family, which pose distinctive interpretability challenges due to their scale and complexity. In terms of existing methods, we classify them into local and global analyses, based on their explanatory objectives. When considering the utilization of explainability, we explore several compelling methods that concentrate on model editing, control generation, and model enhancement. Additionally, we examine representative evaluation metrics and datasets, elucidating their advantages and limitations. Our goal is to reconcile theoretical and empirical understanding with practical implementation, proposing exciting avenues for explanatory techniques and their applications in the LLMs era.

From Understanding to Utilization: A Survey on Explainability for Large Language Models

TL;DR

transformer layers and

-dimensional hidden states, the survey provides a concrete framework to advance explainability in the LLM era.

Abstract

Paper Structure (34 sections, 3 equations, 4 figures)

This paper contains 34 sections, 3 equations, 4 figures.

Introduction
Overview
Categorization of Methods
Explainability for Large Language Models
Local Analysis
Feature Attribution Explanation
Perturbation-Based Methods.
Gradient-Based Methods.
Vector-Based Methods.
Dissecting Transformer Blocks
Analyzing MHSA Sublayers.
Analyzing MLP Sublayers.
Global Analysis
Probing-Based Method
Probing Knowledge.
...and 19 more sections

Figures (4)

Figure 1: Categorization of literature on explainability in LLMs, focusing on techniques (left, Section \ref{['sec:exp']}) and their applications (right, Section \ref{['sec:app']}).
Figure 2: Studied role of each Transformer component. (a) gives an overview of attention mechanism in Transformers. Sizes of the colored circles illustrate the value of the scalar or the norm of the corresponding vector kobayashi-etal-2020-attention. (b) analyzes the FFN updates in the vocabulary space, showing that each update can be decomposed to sub-updates corresponding to single FFN parameter vectors, each promoting concepts that are often human-interpretable geva-etal-2022-transformer.
Figure 3: The intensity of each grid cell represents the average causal indirect effect of a hidden state on expressing a factual association. Darker cells indicate stronger causal mediators. It was found that the MLPs at the last subject token and the attention modules at the last token play crucial roles. meng2023locating
Figure 4: (a) Dense Attention Vaswani2017AttentionIA has $O(T^2)$ time complexity and an increasing cache size. Its performance decreases when the text length exceeds the pre-training text length. (b) Window Attention caches the most recent $L$ tokens' KV. While efficient in inference, performance declines sharply once the starting tokens' keys and values are evicted. (c) Sliding Window pope2022efficiently with Re-computation performs well on long texts, but its $O(TL^2)$ complexity, stemming from quadratic attention in context re-computation, makes it considerably slow. (d) StreamingLLM keeps xiao2023efficient the attention sink (several initial tokens) for stable attention computation, combined with the recent tokens. It's efficient and offers stable performance on extended texts.

From Understanding to Utilization: A Survey on Explainability for Large Language Models

TL;DR

Abstract

From Understanding to Utilization: A Survey on Explainability for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)