Table of Contents
Fetching ...

Understanding Post-hoc Explainers: The Case of Anchors

Gianluigi Lopardo, Frederic Precioso, Damien Garreau

TL;DR

This work provides a theoretically grounded examination of Anchors, a local rule-based explainer for text predictions, by formalizing the algorithm under a binary linear classifier with a fixed TF-IDF vectorizer and introducing exhaustive $p$-Anchors as a tractable analysis target. It derives a Gaussian-approximate description of anchor precision via $\overline{\Phi}(L(A))$, where $L(A)$ encodes the interaction between the classifier weights, TF-IDF weights, and anchor composition, with a Berry-Esseen-type bound that scales with the document length $d$. The authors show that, for linear models, Anchors tend to prioritize words with positive influence (aligned with $\lambda_j v_j$) and provide empirical validation on sentiment datasets; the framework aligns anchor selection with theoretical guarantees and aids the development of robust explainability methods. Overall, the paper lays a principled foundation for interpreting post-hoc explanations in text data and offers a pathway to extend the analysis to broader model classes and data modalities.

Abstract

In many scenarios, the interpretability of machine learning models is a highly required but difficult task. To explain the individual predictions of such models, local model-agnostic approaches have been proposed. However, the process generating the explanations can be, for a user, as mysterious as the prediction to be explained. Furthermore, interpretability methods frequently lack theoretical guarantees, and their behavior on simple models is frequently unknown. While it is difficult, if not impossible, to ensure that an explainer behaves as expected on a cutting-edge model, we can at least ensure that everything works on simple, already interpretable models. In this paper, we present a theoretical analysis of Anchors (Ribeiro et al., 2018): a popular rule-based interpretability method that highlights a small set of words to explain a text classifier's decision. After formalizing its algorithm and providing useful insights, we demonstrate mathematically that Anchors produces meaningful results when used with linear text classifiers on top of a TF-IDF vectorization. We believe that our analysis framework can aid in the development of new explainability methods based on solid theoretical foundations.

Understanding Post-hoc Explainers: The Case of Anchors

TL;DR

This work provides a theoretically grounded examination of Anchors, a local rule-based explainer for text predictions, by formalizing the algorithm under a binary linear classifier with a fixed TF-IDF vectorizer and introducing exhaustive -Anchors as a tractable analysis target. It derives a Gaussian-approximate description of anchor precision via , where encodes the interaction between the classifier weights, TF-IDF weights, and anchor composition, with a Berry-Esseen-type bound that scales with the document length . The authors show that, for linear models, Anchors tend to prioritize words with positive influence (aligned with ) and provide empirical validation on sentiment datasets; the framework aligns anchor selection with theoretical guarantees and aids the development of robust explainability methods. Overall, the paper lays a principled foundation for interpreting post-hoc explanations in text data and offers a pathway to extend the analysis to broader model classes and data modalities.

Abstract

In many scenarios, the interpretability of machine learning models is a highly required but difficult task. To explain the individual predictions of such models, local model-agnostic approaches have been proposed. However, the process generating the explanations can be, for a user, as mysterious as the prediction to be explained. Furthermore, interpretability methods frequently lack theoretical guarantees, and their behavior on simple models is frequently unknown. While it is difficult, if not impossible, to ensure that an explainer behaves as expected on a cutting-edge model, we can at least ensure that everything works on simple, already interpretable models. In this paper, we present a theoretical analysis of Anchors (Ribeiro et al., 2018): a popular rule-based interpretability method that highlights a small set of words to explain a text classifier's decision. After formalizing its algorithm and providing useful insights, we demonstrate mathematically that Anchors produces meaningful results when used with linear text classifiers on top of a TF-IDF vectorization. We believe that our analysis framework can aid in the development of new explainability methods based on solid theoretical foundations.
Paper Structure (15 sections, 2 theorems, 4 equations, 2 figures)

This paper contains 15 sections, 2 theorems, 4 equations, 2 figures.

Key Result

Proposition 1

Let $\lambda,\lambda_0$ be the coefficients associated to the linear classifier defined by Eq. eq:def-linear-classifier. Assume that for all $j\in [d]$, $\lambda_jv_j\neq 0$. Define, for all $A \in\mathcal{A}$, Let $\overline{\Phi}\vcentcolon = 1-\Phi$, where $\Phi$ denotes the cumulative distribution function of a $\mathcal{N}(0,1)$. Then, for any $A\in\mathcal{A}$ such that $\left\lvert A\right

Figures (2)

  • Figure 1: Anchors explaining the positive prediction of a black-box model $f$ on an example $\xi$ from the Restaurant review dataset. The anchor $A = \{\textit{great, not, bad, fine}\}$ (in blue), having length $\left\lvert A\right\rvert = 4$ is selected. Intuitively, that a document contains these four words together ensures a positive prediction by $f$ with high probability ($\texttt{precision} = \; 0.97$), while being not too uncommon ($\texttt{coverage} = \; 0.12$).
  • Figure 2: On the left, illustration of Proposition \ref{['prop:approx-prec-maximization']}. On linear models, the algorithm includes words having the highest $\lambda_jv_j$s first. Finally, the minimal anchor satisfying the precision condition $\overline{\Phi}\left(L\left(A\right)\right) \approx\mathrm{Prec}(A) \geq 1-\varepsilon$ is selected, which is $A=(m_1,m_2,m_3,2,0,\ldots,0,0)$ in the example. On the right, validation of Proposition \ref{['prop:approx-prec-maximization']}. Average Jaccard similarity between the anchor $A$ and the first $\left\lvert A\right\rvert$ words ranked by $\lambda_jv_j$ for a logistic model on positive documents and low-confidently classified subset ($\texttt{pr} = g(\varphi(\xi)) < 0.85$, or $\texttt{pr} < 0.75$).

Theorems & Definitions (2)

  • Proposition 1: Precision of a linear classifier
  • Proposition 2: Approximate precision maximization