Table of Contents
Fetching ...

Robust Infidelity: When Faithfulness Measures on Masked Language Models Are Misleading

Evan Crothers, Herna Viktor, Nathalie Japkowicz

TL;DR

This paper questions the practice of evaluating neural text classifier interpretability via faithfulness measures derived from iterative token masking. It shows that such measures are highly sensitive to model initialization and task/dataset, and that masked samples frequently lie outside the training data distribution, leading to unpredictable behavior and misleading cross-model comparisons. The authors connect iterative masking to adversarial-like attacks, and find that adversarial training does not yield consistent gains in fidelity across models or tasks. They propose practical guidance for evaluating interpretability, emphasizing cautious use of masking-based fidelity, accounting for dataset characteristics, and considering the robustness implications of token-level reliance. The work highlights fundamental limits of faithfulness as a universal interpretability metric for transformer-based text classifiers and suggests directions for more principled evaluation.

Abstract

A common approach to quantifying neural text classifier interpretability is to calculate faithfulness metrics based on iteratively masking salient input tokens and measuring changes in the model prediction. We propose that this property is better described as "sensitivity to iterative masking", and highlight pitfalls in using this measure for comparing text classifier interpretability. We show that iterative masking produces large variation in faithfulness scores between otherwise comparable Transformer encoder text classifiers. We then demonstrate that iteratively masked samples produce embeddings outside the distribution seen during training, resulting in unpredictable behaviour. We further explore task-specific considerations that undermine principled comparison of interpretability using iterative masking, such as an underlying similarity to salience-based adversarial attacks. Our findings give insight into how these behaviours affect neural text classifiers, and provide guidance on how sensitivity to iterative masking should be interpreted.

Robust Infidelity: When Faithfulness Measures on Masked Language Models Are Misleading

TL;DR

This paper questions the practice of evaluating neural text classifier interpretability via faithfulness measures derived from iterative token masking. It shows that such measures are highly sensitive to model initialization and task/dataset, and that masked samples frequently lie outside the training data distribution, leading to unpredictable behavior and misleading cross-model comparisons. The authors connect iterative masking to adversarial-like attacks, and find that adversarial training does not yield consistent gains in fidelity across models or tasks. They propose practical guidance for evaluating interpretability, emphasizing cautious use of masking-based fidelity, accounting for dataset characteristics, and considering the robustness implications of token-level reliance. The work highlights fundamental limits of faithfulness as a universal interpretability metric for transformer-based text classifiers and suggests directions for more principled evaluation.

Abstract

A common approach to quantifying neural text classifier interpretability is to calculate faithfulness metrics based on iteratively masking salient input tokens and measuring changes in the model prediction. We propose that this property is better described as "sensitivity to iterative masking", and highlight pitfalls in using this measure for comparing text classifier interpretability. We show that iterative masking produces large variation in faithfulness scores between otherwise comparable Transformer encoder text classifiers. We then demonstrate that iteratively masked samples produce embeddings outside the distribution seen during training, resulting in unpredictable behaviour. We further explore task-specific considerations that undermine principled comparison of interpretability using iterative masking, such as an underlying similarity to salience-based adversarial attacks. Our findings give insight into how these behaviours affect neural text classifiers, and provide guidance on how sensitivity to iterative masking should be interpreted.
Paper Structure (15 sections, 4 equations, 4 figures, 3 tables)

This paper contains 15 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Iterative token removal in descending order of feature importance on a sample from SST-2. Despite identifying the most important tokens, the classification is unchanged during either iterative masking or iterative deletion.
  • Figure 2: UMAP projections of sample embeddings at varying levels of masking. Masking more tokens moves the resulting embeddings further out of domain of the original dataset. Masking a couple tokens within a dataset with a longer average sequence length has a relatively minor effect (e.g., see the Wikipedia Toxic Comments examples), but longer samples still generally require a significant portion of tokens to be masked to change classification (see Table \ref{['tab:fidelityratios']})
  • Figure 3: Comparison of centroid cosine similarity and mean standard deviation of embedding vectors between BERT and RoBERTa across TCAB datasets.
  • Figure 4: Word-level adversarial attack vs. iterative masking on AGNews sample on BERT classifier. Both adversarial attack and iterative masking perturb the prediction after manipulating a single token.