A Question on the Explainability of Large Language Models and the Word-Level Univariate First-Order Plausibility Assumption

Jeremie Bogaert; Francois-Xavier Standaert

A Question on the Explainability of Large Language Models and the Word-Level Univariate First-Order Plausibility Assumption

Jeremie Bogaert, Francois-Xavier Standaert

TL;DR

This paper gives statistical definitions for the explanations' signal, noise and signal-to-noise ratio and discusses the possibility to improve these results with alternative definitions of signal and noise that would capture more complex explanations and analysis methods, while also questioning the tradeoff with their plausibility for readers.

Abstract

The explanations of large language models have recently been shown to be sensitive to the randomness used for their training, creating a need to characterize this sensitivity. In this paper, we propose a characterization that questions the possibility to provide simple and informative explanations for such models. To this end, we give statistical definitions for the explanations' signal, noise and signal-to-noise ratio. We highlight that, in a typical case study where word-level univariate explanations are analyzed with first-order statistical tools, the explanations of simple feature-based models carry more signal and less noise than those of transformer ones. We then discuss the possibility to improve these results with alternative definitions of signal and noise that would capture more complex explanations and analysis methods, while also questioning the tradeoff with their plausibility for readers.

A Question on the Explainability of Large Language Models and the Word-Level Univariate First-Order Plausibility Assumption

TL;DR

Abstract

Paper Structure (9 sections, 3 equations, 5 figures)

This paper contains 9 sections, 3 equations, 5 figures.

Introduction
Background
Case study
CamemBERT model
Feature-based model
Explanations
Experimental setting
Experimental results
Conclusion

Figures (5)

Figure 1: Generation of equivalent models and compatible inputs.
Figure 2: Example of linguistic attention map (left) and LRP attention map (middle) with translation (right).
Figure 3: Explanations' box-plot for a transformer model. The green and orange dashes respectively show the mean and the median of the attention distribution for each word. Words with a mean attention value far (resp., close) from 0 are attributed more (resp., less) attention in the explanations in general. Tight boxplots (e.g. "ouvert") show a low variability in the attention given by equivalent models. On the contrary, wide boxplots (e.g. "ce") show a high variability.
Figure 4: Explanation's box-plot for a feature-based model. As there is no variability in the attention attributed by the feature based model, each boxplot is a simple line. Words that are not used in any feature have an attention score of 0.
Figure 5: Metrics estimation for illustrative texts.

A Question on the Explainability of Large Language Models and the Word-Level Univariate First-Order Plausibility Assumption

TL;DR

Abstract

A Question on the Explainability of Large Language Models and the Word-Level Univariate First-Order Plausibility Assumption

Authors

TL;DR

Abstract

Table of Contents

Figures (5)