Sequential Integrated Gradients: a simple but effective method for explaining language models
Joseph Enguehard
TL;DR
Sequential Integrated Gradients (SIG) addresses the drift in meaning that occurs when interpolating all words in language-model explanations by attributing one word at a time while holding other words fixed, using the trainable token <mask> as baseline when possible. The method formalizes per-word gradient-integrated attributions and normalizes them, preserving key axioms like implementation invariance and a per-word completeness relation $SIG_i(x) = \frac{\sum_j SIG_{ij}}{||SIG||}$. Empirically, SIG outperforms IG, DIG, and other baselines across multiple BERT-family models and sentiment datasets, with the mask baseline enhancing explanations and SIG proving robust to baseline choice. While computationally intensive, SIG with fewer interpolation steps can outperform IG with more steps, and the authors discuss extensions to autoregressive models and the ethical implications of explanation methods.
Abstract
Several explanation methods such as Integrated Gradients (IG) can be characterised as path-based methods, as they rely on a straight line between the data and an uninformative baseline. However, when applied to language models, these methods produce a path for each word of a sentence simultaneously, which could lead to creating sentences from interpolated words either having no clear meaning, or having a significantly different meaning compared to the original sentence. In order to keep the meaning of these sentences as close as possible to the original one, we propose Sequential Integrated Gradients (SIG), which computes the importance of each word in a sentence by keeping fixed every other words, only creating interpolations between the baseline and the word of interest. Moreover, inspired by the training procedure of several language models, we also propose to replace the baseline token "pad" with the trained token "mask". While being a simple improvement over the original IG method, we show on various models and datasets that SIG proves to be a very effective method for explaining language models.
