Global-to-Local Support Spectrums for Language Model Explainability

Lucas Agussurja; Xinyang Lu; Bryan Kian Hsiang Low

Global-to-Local Support Spectrums for Language Model Explainability

Lucas Agussurja, Xinyang Lu, Bryan Kian Hsiang Low

TL;DR

The paper tackles explainability by attributing model outputs to training data with a focus on local, test-point–specific explanations. It introduces global-to-local support spectrums, built from support sets in feature space and a locality-constrained optimization that combines a global importance g and a local relevance ell to form spectra S(z_t;k). The framework accommodates both representer-point– and influence-function–based decompositions, yielding general and relative spectrums that reveal how well a test point is supported and how it distinguishes itself from other classes. Experiments on MNIST and on GPT2-XL/Open-LLaMA-7B show that spectrums provide interpretable, test-point–level explanations, enable source attribution in generated text, and help detect biases and spurious correlations, with potential utility for data debugging and model auditing.

Abstract

Existing sample-based methods, like influence functions and representer points, measure the importance of a training point by approximating the effect of its removal from training. As such, they are skewed towards outliers and points that are very close to the decision boundaries. The explanations provided by these methods are often static and not specific enough for different test points. In this paper, we propose a method to generate an explanation in the form of support spectrums which are based on two main ideas: the support sets and a global-to-local importance measure. The support set is the set of training points, in the predicted class, that ``lie in between'' the test point and training points in the other classes. They indicate how well the test point can be distinguished from the points not in the predicted class. The global-to-local importance measure is obtained by decoupling existing methods into the global and local components which are then used to select the points in the support set. Using this method, we are able to generate explanations that are tailored to specific test points. In the experiments, we show the effectiveness of the method in image classification and text generation tasks.

Global-to-Local Support Spectrums for Language Model Explainability

TL;DR

Abstract

Paper Structure (9 sections, 13 equations, 6 figures)

This paper contains 9 sections, 13 equations, 6 figures.

Introduction
Related Works
Support Spectrums
Experiments
MNIST
GPT2-XL and Open-LLaMA-7B
Conclusions
Additional MNIST Examples
Additional Llama2-7B Examples

Figures (6)

Figure 1: An illustrative example showing the training points (red, green, and blue dots) and a test point (black dot) in the learned feature space. (a) The blue line is the discriminant of the class blue whose normal is given by $\hat{W}_b$. The darker blue dots with dark edges are the general support set of the test point. (b) The solid black line is the decision boundary between blue and red, whose normal is $\hat{W}_b - \hat{W}_r$. The darker blue dots are the support set relative to red.
Figure 2: (a) The training points of a 3-class classification problem together with the regions and decision boundaries given by a trained network. (b)-(c) The general spectrums for two different test points (the black dots). Opaque dots indicate points that are in the support set, and transparent dots indicate those that are not. (d)-(e) Importance values given by the excitatory representer points. Darker colors indicate higher values. (f)-(g) Importance values given by the influence function. Similarly, darker colors indicate higher values.
Figure 3: Spectrums for a test point (classified as blue) relative to (a) the red class and (b) the blue class. The opaque blue dots are the relative supports.
Figure 4: The spectrums for two test points (with the same predicted class but different styles) from MNIST.
Figure 5: An example from GPT2-XL. The first row shows parts of the same generated text (the underlined parts are the prompt). Each column shows the top 3 training sequences (together with the source articles) that are closest to the generated text in its general spectrum which is computed, respectively, using the tokens RNA and molecules as the output.
...and 1 more figures

Global-to-Local Support Spectrums for Language Model Explainability

TL;DR

Abstract

Global-to-Local Support Spectrums for Language Model Explainability

Authors

TL;DR

Abstract

Table of Contents

Figures (6)