Table of Contents
Fetching ...

Interpretability Guarantees with Merlin-Arthur Classifiers

Stephan Wäldchen, Kartikey Sharma, Berkant Turan, Max Zimmer, Sebastian Pokutta

TL;DR

An interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks is proposed that uses the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult.

Abstract

We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.

Interpretability Guarantees with Merlin-Arthur Classifiers

TL;DR

An interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks is proposed that uses the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult.

Abstract

We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.
Paper Structure (48 sections, 11 theorems, 95 equations, 17 figures, 3 tables, 2 algorithms)

This paper contains 48 sections, 11 theorems, 95 equations, 17 figures, 3 tables, 2 algorithms.

Key Result

Lemma 2.5

Given $\mathfrak{D}=(D,\mathcal{D}, c)$, $M\in \mathcal{M}(D)$ and $\delta \in [0,1]$. Let $\mathbf{x}, \mathbf{y}\sim\mathcal{D}$, then with probability $1- \delta^{-1}\mathopen{}\left( 1-\mathrm{Pr}_{\mathcal{D}}(M) \right)\mathclose{}$, $M(\mathbf{y})$ is a feature s.t.

Figures (17)

  • Figure 1: The Merlin-Arthur classifier consists of two interactive agents that communicate over an exchanged feature. This feature serves as an interpretation of the classification.
  • Figure 2: Illustration of "cheating" behaviour. In the original dataset, the features "sea" and "sky" appear equally in both classes "boat" and "island". In the partial images created by Merlin, the "sea" feature appears only in "boat" images and the "sky" feature only for "islands". Thus, these features now strongly indicate the class of the image. This allows Merlin to communicate the correct class with uninformative features --- in contrast to our concept of an interpretable classifier.
  • Figure 3: Strategy evolution with Morgana. a) Due to the "cheating" strategy from \ref{['fig:cheating']}, Arthur expects the "sea" feature for boats and the "sky" for islands. Morgana can exploit this and send the "sky" feature to trick Arthur into classifying a "boat" image as an "island" (and vice versa with "sea"). b) To not be fooled into the wrong class when represented with an ambiguous feature, Arthur refrains from giving a concrete classification. c) Since Arthur does not know who sends the features, he now cannot leverage the uninformative features sent by Merlin. d) Merlin adapts his strategy to only send unambiguous features that cannot be used by Morgana to fool Arthur.
  • Figure 4: Example of a dataset with an AFC $\kappa=6$. The "fruit" features are concentrated in one image for class $l=-1$ but spread out over six images for $l=1$ (vice versa for the "fish" features). Each individual feature is not indicative of the class as it appears exactly once in each class. Nevertheless, Arthur and Merlin can exchange "fruits" to indicate "$l=1$" and "fish" for "$l=-1$". The images where this strategy fails or can be exploited by Morgana are the two images on the left. Applying \ref{['thm:minmax']}, we get $\epsilon_M = \frac{1}{7}$ and the set $D^{\prime}$ corresponds to all images with a single feature. Restricted to $D^{\prime}$, the features determine the class completely.
  • Figure 5:
  • ...and 12 more figures

Theorems & Definitions (28)

  • Definition 2.1
  • Definition 2.2: Two-class Data Space
  • Definition 2.3: Feature Selector
  • Definition 2.4: Feature Classifier
  • Definition 2.5: Average Precision
  • Lemma 2.5
  • Theorem 2.6
  • Definition 2.7: Asymmetric feature correlation
  • Lemma 2.7
  • Definition 2.8: Relative Success Rate
  • ...and 18 more