Faithful and Robust Local Interpretability for Textual Predictions

Gianluigi Lopardo; Frederic Precioso; Damien Garreau

Faithful and Robust Local Interpretability for Textual Predictions

Gianluigi Lopardo, Frederic Precioso, Damien Garreau

TL;DR

This work tackles faithful, robust local interpretability for text predictions by introducing FRED, a perturbation-based explainer that yields three outputs: a minimal influential word subset, per-token importance scores, and counterfactual examples. It formalizes the explanation framework via a drop-in-prediction paradigm, using $d(x)=\mathbb{E}[f(x)]-f(x)$ and $\Delta_c=\mathbb{E}[f(x)]-\mathbb{E}[f(x)\mid c\notin x]$, and optimizes to minimize explanation length under a significance constraint. The authors prove theoretical properties for interpretable classifiers (notably linear models and shortcut detectors) and validate FRED empirically against state-of-the-art explainers across multiple datasets and models, showing improved faithfulness and robustness, especially on longer documents and modern architectures. The approach offers practical, theoretically grounded insights for understanding textual predictions and provides a publicly available implementation to support reproducibility and application in real-world settings.

Abstract

Interpretability is essential for machine learning models to be trusted and deployed in critical domains. However, existing methods for interpreting text models are often complex, lack mathematical foundations, and their performance is not guaranteed. In this paper, we propose FRED (Faithful and Robust Explainer for textual Documents), a novel method for interpreting predictions over text. FRED offers three key insights to explain a model prediction: (1) it identifies the minimal set of words in a document whose removal has the strongest influence on the prediction, (2) it assigns an importance score to each token, reflecting its influence on the model's output, and (3) it provides counterfactual explanations by generating examples similar to the original document, but leading to a different prediction. We establish the reliability of FRED through formal definitions and theoretical analyses on interpretable classifiers. Additionally, our empirical evaluation against state-of-the-art methods demonstrates the effectiveness of FRED in providing insights into text models.

Faithful and Robust Local Interpretability for Textual Predictions

TL;DR

and

, and optimizes to minimize explanation length under a significance constraint. The authors prove theoretical properties for interpretable classifiers (notably linear models and shortcut detectors) and validate FRED empirically against state-of-the-art explainers across multiple datasets and models, showing improved faithfulness and robustness, especially on longer documents and modern architectures. The approach offers practical, theoretically grounded insights for understanding textual predictions and provides a publicly available implementation to support reproducibility and application in real-world settings.

Abstract

Paper Structure (31 sections, 4 theorems, 24 equations, 3 figures, 35 tables, 1 algorithm)

This paper contains 31 sections, 4 theorems, 24 equations, 3 figures, 35 tables, 1 algorithm.

Introduction
Organization of the paper.
Related work
FRED
Setting and Notation
Drop in prediction
Empirical drop in prediction.
Sampling scheme
Remark.
Explanations
Analysis on Explainable Classifiers
Linear classifiers
Shortcuts detection
Experiments
Faithfulness.
...and 16 more sections

Key Result

Lemma 1

For a candidate explanation $c$, let $n_c$ represent the count of instances in the dataset $x$ where $c$ is not included in the sample. Then, as $n \to \infty$, the empirical drop in prediction $\widehat{\Delta}_c$ associated to the candidate $c$ converges in probability to

Figures (3)

Figure 1: FRED explaining the prediction of a sentiment analysis model for the restaurant review “poor drinks, decent food, great service”, classified as “positive”. The average confidence over the sample is $0.556$. (a) FRED identifies the minimal subset of tokens that, if removed, make the prediction drop by a specified threshold $\varepsilon(=0.5$). (b) Saliency map of token importance score: dark green (resp., red) means high positive (resp., negative) influence. (c) Samples close to the example, but classified as "negative". Perturbations with respect to the example are in orange.
Figure 2: Illustration of FRED's pos-sampling scheme (left panel) and mask-sampling scheme (right panel) for computing the drop of a candidate. For a given example $\xi$, FRED generates $n$ perturbed samples $x_1, \ldots, x_n$ by independently perturbing tokens with probability $p(=0.5)$. Each sample is associated with the model's drop in prediction $d(x_j)$. Finally, the empirical drop $\widehat{\Delta}_{c}$ of a candidate is computed by averaging the drops over the samples that do not contain $c$. In the example, the candidate consists of the words decent and great. The samples where both tokens are perturbed are highlighted in gray. The empirical drop associated to {decent, great} is therefore computed by averaging $d(x_3)$, $d(x_5)$, $\ldots$$d(x_n)$.
Figure 3: Illustration of Proposition \ref{['prop:linear-models']}. On linear models, Algorithm \ref{['algo:fred']} includes words having the highest $\lambda_jv_j$s first. Finally, the minimal candidate satisfying the threshold condition is selected, which is $c=(m_1,m_2,m_3,2,0,\ldots,0,0)$ in the example.

Theorems & Definitions (5)

Lemma 1: Convergence of Empirical Drop $\widehat{\Delta}_c$
Lemma 2: Choosing $n$
Definition 3: TF-IDF
Proposition 4: Linear models
Proposition 5: Presence of shortcuts

Faithful and Robust Local Interpretability for Textual Predictions

TL;DR

Abstract

Faithful and Robust Local Interpretability for Textual Predictions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (5)