Table of Contents
Fetching ...

Interpretable Counterfactual Explanations Guided by Prototypes

Arnaud Van Looveren, Janis Klaise

TL;DR

This work introduces a prototype-guided, model-agnostic framework for fast, interpretable counterfactual explanations. By incorporating a prototype loss with encoder- or kd-tree-based class representations and robust handling of categorical variables, the method steers perturbations toward interpretable counterfactuals and removes the gradient-evaluation bottleneck for black-box models. It also provides two instance-level interpretability metrics and demonstrates substantial speedups and improved local interpretability on MNIST and Wisconsin Breast Cancer, with extensions to categorical data via ABDM/MVDM embeddings. The approach is practical for real-world explanations and is released as an open-source library (alibi).

Abstract

We propose a fast, model agnostic method for finding interpretable counterfactual explanations of classifier predictions by using class prototypes. We show that class prototypes, obtained using either an encoder or through class specific k-d trees, significantly speed up the the search for counterfactual instances and result in more interpretable explanations. We introduce two novel metrics to quantitatively evaluate local interpretability at the instance level. We use these metrics to illustrate the effectiveness of our method on an image and tabular dataset, respectively MNIST and Breast Cancer Wisconsin (Diagnostic). The method also eliminates the computational bottleneck that arises because of numerical gradient evaluation for $\textit{black box}$ models.

Interpretable Counterfactual Explanations Guided by Prototypes

TL;DR

This work introduces a prototype-guided, model-agnostic framework for fast, interpretable counterfactual explanations. By incorporating a prototype loss with encoder- or kd-tree-based class representations and robust handling of categorical variables, the method steers perturbations toward interpretable counterfactuals and removes the gradient-evaluation bottleneck for black-box models. It also provides two instance-level interpretability metrics and demonstrates substantial speedups and improved local interpretability on MNIST and Wisconsin Breast Cancer, with extensions to categorical data via ABDM/MVDM embeddings. The approach is practical for real-world explanations and is released as an open-source library (alibi).

Abstract

We propose a fast, model agnostic method for finding interpretable counterfactual explanations of classifier predictions by using class prototypes. We show that class prototypes, obtained using either an encoder or through class specific k-d trees, significantly speed up the the search for counterfactual instances and result in more interpretable explanations. We introduce two novel metrics to quantitatively evaluate local interpretability at the instance level. We use these metrics to illustrate the effectiveness of our method on an image and tabular dataset, respectively MNIST and Breast Cancer Wisconsin (Diagnostic). The method also eliminates the computational bottleneck that arises because of numerical gradient evaluation for models.

Paper Structure

This paper contains 31 sections, 18 equations, 13 figures, 4 tables, 2 algorithms.

Figures (13)

  • Figure 1: (a) Examples of original and counterfactual instances on the MNIST dataset along with predictions of a CNN model. (b) A counterfactual instance on the Adult (Census) dataset highlighting the feature changes required to alter the prediction of an NN model.
  • Figure 2: First row: (a) original instance and (b) uninterpretable counterfactual $3$. (c), (d) and (e) are reconstructions of (b) with respectively $\text{AE}_{3}$, $\text{AE}_{5}$ and $\text{AE}$. Second row: (a) original instance and (b) interpretable counterfactual $6$. (c), (d) and (e) are reconstructions of (b) with respectively $\text{AE}_{6}$, $\text{AE}_{5}$ and $\text{AE}$.
  • Figure 3: (a) Mean time in seconds and number of gradient updates needed to find a satisfactory counterfactual for objective functions $A$ to $F$ across all MNIST classes. The error bars represent the standard deviation to illustrate variability between approaches. (b) Mean IM1 and IM2 for objective functions $A$ to $F$ across all MNIST classes (lower is better). The error bars represent the $95$% confidence bounds. (c) Sparsity measure $\text{EN}(\delta)$ for loss functions $A$ to $F$. The error bars represent the $95$% confidence bounds.
  • Figure 4: (a) Shows the original instance, (b) to (g) on the first row illustrate counterfactuals generated by using loss functions $A$ to $F$. (b) to (g) on the second row show the reconstructed counterfactuals using $AE$.
  • Figure 5: Left: Embedding of the categorical variable "Education" in numerical space using association based distance metric (ABDM). Right: Frequency based embedding.
  • ...and 8 more figures