Explaining Pre-Trained Language Models with Attribution Scores: An Analysis in Low-Resource Settings

Wei Zhou; Heike Adel; Hendrik Schuff; Ngoc Thang Vu

Explaining Pre-Trained Language Models with Attribution Scores: An Analysis in Low-Resource Settings

Wei Zhou, Heike Adel, Hendrik Schuff, Ngoc Thang Vu

TL;DR

This work analyzes attribution scores extracted from prompt-based models w.r.t. plausibility and faithfulness and introduces training size as another dimension into the analysis and finds that using the prompting paradigm yields more plausible explanations than fine-tuning the models in low-resource settings.

Abstract

Attribution scores indicate the importance of different input parts and can, thus, explain model behaviour. Currently, prompt-based models are gaining popularity, i.a., due to their easier adaptability in low-resource settings. However, the quality of attribution scores extracted from prompt-based models has not been investigated yet. In this work, we address this topic by analyzing attribution scores extracted from prompt-based models w.r.t. plausibility and faithfulness and comparing them with attribution scores extracted from fine-tuned models and large language models. In contrast to previous work, we introduce training size as another dimension into the analysis. We find that using the prompting paradigm (with either encoder-based or decoder-based models) yields more plausible explanations than fine-tuning the models in low-resource settings and Shapley Value Sampling consistently outperforms attention and Integrated Gradients in terms of leading to more plausible and faithful explanations.

Explaining Pre-Trained Language Models with Attribution Scores: An Analysis in Low-Resource Settings

TL;DR

Abstract

Paper Structure (27 sections, 5 figures, 4 tables)

This paper contains 27 sections, 5 figures, 4 tables.

Introduction
Extraction of Attribution Scores
Extraction from PBMs.
Extraction from FTMs.
Extraction from LLMs.
Experimental Setup
Tasks and data sets.
Base models.
Prompting methods.
Training details.
Evaluation metrics.
Results
Comparing PBMs and FTMs
Plausibility.
Plausibility error analysis.
...and 12 more sections

Figures (5)

Figure 1: Extraction of explanatory signals from PBMs. Yellow boxes: actual task input. Blue boxes: trigger tokens. Pink box: prediction token. Orange boxes: last hidden representations of PBM. Green box: predicted label (converted by verbalizer, e.g., positive $\rightarrow$ great, negative $\rightarrow$ bad).
Figure 2: Plausibility (the higher the better) and faithfulness (the lower the better) scores for different prompting methods and fine-tuning and different explanation methods, averaged across base models and seeds. The faithfulness results are shown as the difference between faithfulness scores of the resp. explanation method and the gold standard. attn: attention, ig: Integrated Gradients, shap: ShapSample.
Figure 3: The $F_{1}$ scores of models trained with different sizes. From top to bottom: TSE and e-SNLI.
Figure 4: The Plausibility scores of explanatory signals, averaged across base models, training sizes and prompting methods. attn stands for attention. ig stands for Integrated Gradients and shap stands for Shapley Value Sampling. gold stands for the gold annotations. NS stands for the no significant difference. *, **, *** stand for p-value <.05, .01 and .001.
Figure 5: The Plausibility scores of base models, averaged across saliency methods, training sizes and prompting methods. attn stands for attention. ig stands for Integrated Gradients and shap stands for Shapley Value Sampling. gold stands for the gold annotations.NS stands for the no significant difference. *, **, *** stand for p-value <.05, .01 and .001.

Explaining Pre-Trained Language Models with Attribution Scores: An Analysis in Low-Resource Settings

TL;DR

Abstract

Explaining Pre-Trained Language Models with Attribution Scores: An Analysis in Low-Resource Settings

Authors

TL;DR

Abstract

Table of Contents

Figures (5)