Table of Contents
Fetching ...

Understanding Interpretability by generalized distillation in Supervised Classification

Adit Agarwal, K. K. Shukla, Arjan Kuijper, Anirban Mukhopadhyay

TL;DR

The paper proposes an information-theoretic interpretation-by-distillation framework to quantify interpretability as the information gain when a known model emulates a black-box model's decision boundary, decoupled from ground-truth accuracy. It defines model entropy and complexity for Piece-Wise Linear Networks (PWLNs), derives theoretical entropy and interpretability bounds via graph-coloring concepts, and presents an empirical pathway to compute interpretability using a global surrogate. The authors validate the framework on MNIST, Fashion-MNIST, and Stanford40, demonstrating that interpretability can be measured globally, is not tightly tied to accuracy, and increases with more queries toward complete interpretation. They also discuss limitations of current bounds and outline future work to extend the framework to other architectures, incorporate boundary geometry, and study dataset complexity effects, aiming for tighter, broadly applicable interpretability guarantees.

Abstract

The ability to interpret decisions taken by Machine Learning (ML) models is fundamental to encourage trust and reliability in different practical applications. Recent interpretation strategies focus on human understanding of the underlying decision mechanisms of the complex ML models. However, these strategies are restricted by the subjective biases of humans. To dissociate from such human biases, we propose an interpretation-by-distillation formulation that is defined relative to other ML models. We generalize the distillation technique for quantifying interpretability, using an information-theoretic perspective, removing the role of ground-truth from the definition of interpretability. Our work defines the entropy of supervised classification models, providing bounds on the entropy of Piece-Wise Linear Neural Networks (PWLNs), along with the first theoretical bounds on the interpretability of PWLNs. We evaluate our proposed framework on the MNIST, Fashion-MNIST and Stanford40 datasets and demonstrate the applicability of the proposed theoretical framework in different supervised classification scenarios.

Understanding Interpretability by generalized distillation in Supervised Classification

TL;DR

The paper proposes an information-theoretic interpretation-by-distillation framework to quantify interpretability as the information gain when a known model emulates a black-box model's decision boundary, decoupled from ground-truth accuracy. It defines model entropy and complexity for Piece-Wise Linear Networks (PWLNs), derives theoretical entropy and interpretability bounds via graph-coloring concepts, and presents an empirical pathway to compute interpretability using a global surrogate. The authors validate the framework on MNIST, Fashion-MNIST, and Stanford40, demonstrating that interpretability can be measured globally, is not tightly tied to accuracy, and increases with more queries toward complete interpretation. They also discuss limitations of current bounds and outline future work to extend the framework to other architectures, incorporate boundary geometry, and study dataset complexity effects, aiming for tighter, broadly applicable interpretability guarantees.

Abstract

The ability to interpret decisions taken by Machine Learning (ML) models is fundamental to encourage trust and reliability in different practical applications. Recent interpretation strategies focus on human understanding of the underlying decision mechanisms of the complex ML models. However, these strategies are restricted by the subjective biases of humans. To dissociate from such human biases, we propose an interpretation-by-distillation formulation that is defined relative to other ML models. We generalize the distillation technique for quantifying interpretability, using an information-theoretic perspective, removing the role of ground-truth from the definition of interpretability. Our work defines the entropy of supervised classification models, providing bounds on the entropy of Piece-Wise Linear Neural Networks (PWLNs), along with the first theoretical bounds on the interpretability of PWLNs. We evaluate our proposed framework on the MNIST, Fashion-MNIST and Stanford40 datasets and demonstrate the applicability of the proposed theoretical framework in different supervised classification scenarios.

Paper Structure

This paper contains 24 sections, 7 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Interpretation as a communication mechanism between known model A and black-box model B, where A performs a series of (possibly infinite) queries to B, until it emulates B's decision boundary and no more information gain is possible.
  • Figure 2: Empirical Interpretation (see text for details on color coding of steps 1-5)
  • Figure 3: Left: Empirical Interpretability when an InceptionV3 network trained on the Stanford40 dataset is interpreted by another InceptionV3 network trained on different cropped versions of the same set of images. Right: An example from the Stanford40 dataset labelled as "Holding An Umbrella", showing the original image (top right), Original Cropped Image (bottom left), Cropped Top Left Image (top left) and Cropped Bottom Right Image (bottom right).
  • Figure 4: Evaluation of Empirical Interpretability on MNIST and Fashion MNIST. Different ensembles (A) are used to interpret a MiniVGGNet (B).
  • Figure 5: Effect of the number of samples on Empirical Interpretability, when a 1-layer PWLN-R(a), Decision Tree(b) and an SVM(c) are used to interpret a 4-layer PWLN-R.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5