Interpretability Guarantees with Merlin-Arthur Classifiers

Stephan Wäldchen; Kartikey Sharma; Berkant Turan; Max Zimmer; Sebastian Pokutta

Interpretability Guarantees with Merlin-Arthur Classifiers

Stephan Wäldchen, Kartikey Sharma, Berkant Turan, Max Zimmer, Sebastian Pokutta

TL;DR

An interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks is proposed that uses the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult.

Abstract

We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.

Interpretability Guarantees with Merlin-Arthur Classifiers

TL;DR

Abstract

Paper Structure (48 sections, 11 theorems, 95 equations, 17 figures, 3 tables, 2 algorithms)

This paper contains 48 sections, 11 theorems, 95 equations, 17 figures, 3 tables, 2 algorithms.

Introduction
Related Work
Contribution
Theoretical Framework
Mutual Information, Entropy and Precision
Merlin-Arthur Classification
Asymmetric Feature Correlation
Realistic Algorithms and Relative Success Rate
Finitely Sampled and Biased Dataset
Numerical Implementation
Preventing Manipulation
Evaluation of Theoretical Bounds
Discussion and Limitations
Conclusion
Acknowledgement
...and 33 more sections

Key Result

Lemma 2.5

Given $\mathfrak{D}=(D,\mathcal{D}, c)$, $M\in \mathcal{M}(D)$ and $\delta \in [0,1]$. Let $\mathbf{x}, \mathbf{y}\sim\mathcal{D}$, then with probability $1- \delta^{-1}\mathopen{}\left( 1-\mathrm{Pr}_{\mathcal{D}}(M) \right)\mathclose{}$, $M(\mathbf{y})$ is a feature s.t.

Figures (17)

Figure 1: The Merlin-Arthur classifier consists of two interactive agents that communicate over an exchanged feature. This feature serves as an interpretation of the classification.
Figure 2: Illustration of "cheating" behaviour. In the original dataset, the features "sea" and "sky" appear equally in both classes "boat" and "island". In the partial images created by Merlin, the "sea" feature appears only in "boat" images and the "sky" feature only for "islands". Thus, these features now strongly indicate the class of the image. This allows Merlin to communicate the correct class with uninformative features --- in contrast to our concept of an interpretable classifier.
Figure 3: Strategy evolution with Morgana. a) Due to the "cheating" strategy from \ref{['fig:cheating']}, Arthur expects the "sea" feature for boats and the "sky" for islands. Morgana can exploit this and send the "sky" feature to trick Arthur into classifying a "boat" image as an "island" (and vice versa with "sea"). b) To not be fooled into the wrong class when represented with an ambiguous feature, Arthur refrains from giving a concrete classification. c) Since Arthur does not know who sends the features, he now cannot leverage the uninformative features sent by Merlin. d) Merlin adapts his strategy to only send unambiguous features that cannot be used by Morgana to fool Arthur.
Figure 4: Example of a dataset with an AFC $\kappa=6$. The "fruit" features are concentrated in one image for class $l=-1$ but spread out over six images for $l=1$ (vice versa for the "fish" features). Each individual feature is not indicative of the class as it appears exactly once in each class. Nevertheless, Arthur and Merlin can exchange "fruits" to indicate "$l=1$" and "fish" for "$l=-1$". The images where this strategy fails or can be exploited by Morgana are the two images on the left. Applying \ref{['thm:minmax']}, we get $\epsilon_M = \frac{1}{7}$ and the set $D^{\prime}$ corresponds to all images with a single feature. Restricted to $D^{\prime}$, the features determine the class completely.
Figure 5:
...and 12 more figures

Theorems & Definitions (28)

Definition 2.1
Definition 2.2: Two-class Data Space
Definition 2.3: Feature Selector
Definition 2.4: Feature Classifier
Definition 2.5: Average Precision
Lemma 2.5
Theorem 2.6
Definition 2.7: Asymmetric feature correlation
Lemma 2.7
Definition 2.8: Relative Success Rate
...and 18 more

Interpretability Guarantees with Merlin-Arthur Classifiers

TL;DR

Abstract

Interpretability Guarantees with Merlin-Arthur Classifiers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (28)