Faithfulness Measurable Masked Language Models

Andreas Madsen; Siva Reddy; Sarath Chandar

Faithfulness Measurable Masked Language Models

Andreas Madsen, Siva Reddy, Sarath Chandar

TL;DR

This paper tackles the problem that token-importance explanations for NLP models can be misleading by introducing inherently faithfulness measurable models (FMMs). It achieves this by a simple masked fine-tuning strategy that makes masking tokens in-distribution, enabling erasure-based faithfulness evaluation without retraining. The authors validate masking in-distribution using MaSF and assess faithfulness with RACU/ACU across 16 datasets, finding occlusion-based explanations to be the most faithful under in-distribution masking, while signed variants often offer additional fidelity. The approach provides a practical, model-instance-specific framework that makes faithfulness cheap to measure and potentially optimizable, advancing interpretable NLP by aligning explanations with the true reasoning of deployed models.

Abstract

A common approach to explaining NLP models is to use importance measures that express which tokens are important for a prediction. Unfortunately, such explanations are often wrong despite being persuasive. Therefore, it is essential to measure their faithfulness. One such metric is if tokens are truly important, then masking them should result in worse model performance. However, token masking introduces out-of-distribution issues, and existing solutions that address this are computationally expensive and employ proxy models. Furthermore, other metrics are very limited in scope. This work proposes an inherently faithfulness measurable model that addresses these challenges. This is achieved using a novel fine-tuning method that incorporates masking, such that masking tokens become in-distribution by design. This differs from existing approaches, which are completely model-agnostic but are inapplicable in practice. We demonstrate the generality of our approach by applying it to 16 different datasets and validate it using statistical in-distribution tests. The faithfulness is then measured with 9 different importance measures. Because masking is in-distribution, importance measures that themselves use masking become consistently more faithful. Additionally, because the model makes faithfulness cheap to measure, we can optimize explanations towards maximal faithfulness; thus, our model becomes indirectly inherently explainable.

Faithfulness Measurable Masked Language Models

TL;DR

Abstract

Paper Structure (66 sections, 3 equations, 25 figures, 12 tables, 3 algorithms)

This paper contains 66 sections, 3 equations, 25 figures, 12 tables, 3 algorithms.

Introduction
To summarize, our contributions are:
Related Work
Correlating importance measures
Known explanations in synthetic tasks
Similar inputs, similar explanation
Removing important information should affect the prediction
Inherently faithfulness measurable models (FMMs)
Faithfulness of importance measures
Masked fine-tuning
In-distribution validation
Faithfulness metric
Importance measures (IMs)
Gradient-based vs occlusion-based
Gradient-based
...and 51 more sections

Figures (25)

Figure 1: To measure faithfulness, a faithfulness measurable masked language model is created (a), then the model is checked for out-of-distribution issues given an explanation (b), and finally, the faithfulness is measured by masking allegedly important tokens (c). -- [$\mathcal{M}$] is the masking token.
Figure 2: The unmasked performance for each fine-tuning strategy. Plain fine-tuning is the baseline (dashed line). We find that our Masked fine-tuning does not decrease performance. All is computed by taking the average of all datasets. More datasets and a more detailed ablation study can be found in \ref{['appendix:masked-fine-tuning']}.
Figure 3: The 100% masked performance for each fine-tuning strategy. The dashed line represents the class-majority baseline. Results show that masking during training (either our masked fine-tuning or only masking) is necessary. More datasets and a more detailed ablation study can be found in \ref{['appendix:masked-fine-tuning']}.
Figure 4: In-distribution p-values using MaSF, for RoBERTa-base with and without masked fine-tuning. The masked tokens are chosen according to an importance measure. P-values below the dashed line show out-of-distribution (OOD) results, given a 5% risk of a false positive. Results show that only when using masked fine-tuning is masking consistently not OOD. Because the results are highly consistent, the overlapping lines do not hide any important details. More datasets and models in \ref{['sec:appendix:ood']}.
Figure 5: The performance given the masked datasets, where masking is done for the x% allegedly most important tokens according to the importance measure. If the performance for a given explanation is below the "Random" baseline, this shows faithfulness. Although faithfulness is not an absolute concept, so more is better. This plot is for RoBERTa-base and separates importance measures based on their signed and absolute variants. More datasets and models in \ref{['appendix:faithfulness']}.
...and 20 more figures

Faithfulness Measurable Masked Language Models

TL;DR

Abstract

Faithfulness Measurable Masked Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (25)