Sequential Integrated Gradients: a simple but effective method for explaining language models

Joseph Enguehard

Sequential Integrated Gradients: a simple but effective method for explaining language models

Joseph Enguehard

TL;DR

Sequential Integrated Gradients (SIG) addresses the drift in meaning that occurs when interpolating all words in language-model explanations by attributing one word at a time while holding other words fixed, using the trainable token <mask> as baseline when possible. The method formalizes per-word gradient-integrated attributions and normalizes them, preserving key axioms like implementation invariance and a per-word completeness relation $SIG_i(x) = \frac{\sum_j SIG_{ij}}{||SIG||}$. Empirically, SIG outperforms IG, DIG, and other baselines across multiple BERT-family models and sentiment datasets, with the mask baseline enhancing explanations and SIG proving robust to baseline choice. While computationally intensive, SIG with fewer interpolation steps can outperform IG with more steps, and the authors discuss extensions to autoregressive models and the ethical implications of explanation methods.

Abstract

Several explanation methods such as Integrated Gradients (IG) can be characterised as path-based methods, as they rely on a straight line between the data and an uninformative baseline. However, when applied to language models, these methods produce a path for each word of a sentence simultaneously, which could lead to creating sentences from interpolated words either having no clear meaning, or having a significantly different meaning compared to the original sentence. In order to keep the meaning of these sentences as close as possible to the original one, we propose Sequential Integrated Gradients (SIG), which computes the importance of each word in a sentence by keeping fixed every other words, only creating interpolations between the baseline and the word of interest. Moreover, inspired by the training procedure of several language models, we also propose to replace the baseline token "pad" with the trained token "mask". While being a simple improvement over the original IG method, we show on various models and datasets that SIG proves to be a very effective method for explaining language models.

Sequential Integrated Gradients: a simple but effective method for explaining language models

TL;DR

. Empirically, SIG outperforms IG, DIG, and other baselines across multiple BERT-family models and sentiment datasets, with the mask baseline enhancing explanations and SIG proving robust to baseline choice. While computationally intensive, SIG with fewer interpolation steps can outperform IG with more steps, and the authors discuss extensions to autoregressive models and the ethical implications of explanation methods.

Abstract

Paper Structure (19 sections, 7 equations, 1 figure, 8 tables)

This paper contains 19 sections, 7 equations, 1 figure, 8 tables.

Introduction
Method
SIG formulation
Axioms satisfied by SIG
Using mask instead of pad as a baseline
Experiments
Experiments design
Results
Comparison with other feature attribution methods
Comparison between IG and DIG
Choice of the baseline token
Time complexity of SIG
Comparison of IG and SIG on several examples
Conclusion
On the symmetry-preserving axiom of Sequential Integrated Gradients
...and 4 more sections

Figures (1)

Figure 1: Comparison between IG, DIG, and our method: SIG. While DIG improves on IG by creating discretized paths between the data and the baseline, it can produce sentences with a different meaning compared to the original one. Our method tackles this issue by fixing every word to their true value except one, and moving the remaining word along a straight path (SIG)

Sequential Integrated Gradients: a simple but effective method for explaining language models

TL;DR

Abstract

Sequential Integrated Gradients: a simple but effective method for explaining language models

Authors

TL;DR

Abstract

Table of Contents

Figures (1)