Table of Contents
Fetching ...

New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing

Andreas Madsen

TL;DR

FMMs yield explanations that are near theoretical optimal in terms of faithfulness, which shows, that even simple modifications to the model, such as randomly masking the training dataset, can drastically change the situation and result in consistently faithful explanations.

Abstract

As machine learning becomes more widespread and is used in more critical applications, it's important to provide explanations for these models, to prevent unintended behavior. Unfortunately, many current interpretability methods struggle with faithfulness. Therefore, this Ph.D. thesis investigates the question "How to provide and ensure faithful explanations for complex general-purpose neural NLP models?" The main thesis is that we should develop new paradigms in interpretability. This is achieved by first developing solid faithfulness metrics and then applying the lessons learned from this investigation to develop new paradigms. The two new paradigms explored are faithfulness measurable models (FMMs) and self-explanations. The idea in self-explanations is to have large language models explain themselves, we identify that current models are not capable of doing this consistently. However, we suggest how this could be achieved. The idea of FMMs is to create models that are designed such that measuring faithfulness is cheap and precise. This makes it possible to optimize an explanation towards maximum faithfulness, which makes FMMs designed to be explained. We find that FMMs yield explanations that are near theoretical optimal in terms of faithfulness. Overall, from all investigations of faithfulness, results show that post-hoc and intrinsic explanations are by default model and task-dependent. However, this was not the case when using FMMs, even with the same post-hoc explanation methods. This shows, that even simple modifications to the model, such as randomly masking the training dataset, as was done in FMMs, can drastically change the situation and result in consistently faithful explanations. This answers the question of how to provide and ensure faithful explanations.

New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing

TL;DR

FMMs yield explanations that are near theoretical optimal in terms of faithfulness, which shows, that even simple modifications to the model, such as randomly masking the training dataset, can drastically change the situation and result in consistently faithful explanations.

Abstract

As machine learning becomes more widespread and is used in more critical applications, it's important to provide explanations for these models, to prevent unintended behavior. Unfortunately, many current interpretability methods struggle with faithfulness. Therefore, this Ph.D. thesis investigates the question "How to provide and ensure faithful explanations for complex general-purpose neural NLP models?" The main thesis is that we should develop new paradigms in interpretability. This is achieved by first developing solid faithfulness metrics and then applying the lessons learned from this investigation to develop new paradigms. The two new paradigms explored are faithfulness measurable models (FMMs) and self-explanations. The idea in self-explanations is to have large language models explain themselves, we identify that current models are not capable of doing this consistently. However, we suggest how this could be achieved. The idea of FMMs is to create models that are designed such that measuring faithfulness is cheap and precise. This makes it possible to optimize an explanation towards maximum faithfulness, which makes FMMs designed to be explained. We find that FMMs yield explanations that are near theoretical optimal in terms of faithfulness. Overall, from all investigations of faithfulness, results show that post-hoc and intrinsic explanations are by default model and task-dependent. However, this was not the case when using FMMs, even with the same post-hoc explanation methods. This shows, that even simple modifications to the model, such as randomly masking the training dataset, as was done in FMMs, can drastically change the situation and result in consistently faithful explanations. This answers the question of how to provide and ensure faithful explanations.

Paper Structure

This paper contains 249 sections, 31 equations, 120 figures, 28 tables, 3 algorithms.

Figures (120)

  • Figure 1: Hypothetical visualization of \ref{['sec:survey:adversarial-examples:hotflip']}. The highlight indicates the gradient w.r.t. the input, which HotFlip uses to select which token to change. $\mathbf{x}$ indicates the original sentence, and $\tilde{\mathbf{x}}$ indicates the adversarial sentence.
  • Figure 1: Shows the accumulative importance score relative to the total importance score, for the top-k number of tokens. The metric is averaged over 5 seeds with a 95% confidence interval. Note that datasets are not equal in sequence-length, the scores are therefore hard to compare across datasets.
  • Figure 1: The all aggregation for the 100% masked performance and unmasked performance. The baseline (dashed line) for 100% masked performance is the class-majority baseline. Unmasked performance is when using no masking for both validation and training.
  • Figure 1: Counterfactual explanation and interpretability-faithfulness evaluation, with the configuration "Persona instruction: objective, Counterfactual target: explicit". The true label is "negative". The initial prediction was "correct". The interpretability-faithfulness was evaluted to be "faithful".
  • Figure 2: Hypothetical results of using \ref{['sec:survey:adversarial-examples:sea']}Ribeiro2018. Note that unlike \ref{['sec:survey:adversarial-examples:hotflip']}, \ref{['sec:survey:adversarial-examples:sea']} can change and delete multiple tokens simultaneously as it samples from a paraphrasing model. Again, $\mathbf{x}$ indicates the original sentence, $\tilde{\mathbf{x}}$ indicates the adversarial sentence, and $S(\mathbf{x}, \tilde{\mathbf{x}})$ is the semantical-equivalency-score which must be at least $0.8$.
  • ...and 115 more figures