Table of Contents
Fetching ...

Interpretability Needs a New Paradigm

Andreas Madsen, Himabindu Lakkaraju, Siva Reddy, Sarath Chandar

TL;DR

This paper's position is that the authors should think about new paradigms while staying vigilant regarding faithfulness, and presents 3 emerging paradigms for interpretability, which propose to develop models that produce both a prediction and an explanation.

Abstract

Interpretability is the study of explaining models in understandable terms to humans. At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only models designed to be explained can be explained, and the post-hoc paradigm, which believes that black-box models can be explained. At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior. This is important, as false but convincing explanations lead to unsupported confidence in artificial intelligence (AI), which can be dangerous. This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness. First, by examining the history of paradigms in science, we see that paradigms are constantly evolving. Then, by examining the current paradigms, we can understand their underlying beliefs, the value they bring, and their limitations. Finally, this paper presents 3 emerging paradigms for interpretability. The first paradigm designs models such that faithfulness can be easily measured. Another optimizes models such that explanations become faithful. The last paradigm proposes to develop models that produce both a prediction and an explanation.

Interpretability Needs a New Paradigm

TL;DR

This paper's position is that the authors should think about new paradigms while staying vigilant regarding faithfulness, and presents 3 emerging paradigms for interpretability, which propose to develop models that produce both a prediction and an explanation.

Abstract

Interpretability is the study of explaining models in understandable terms to humans. At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only models designed to be explained can be explained, and the post-hoc paradigm, which believes that black-box models can be explained. At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior. This is important, as false but convincing explanations lead to unsupported confidence in artificial intelligence (AI), which can be dangerous. This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness. First, by examining the history of paradigms in science, we see that paradigms are constantly evolving. Then, by examining the current paradigms, we can understand their underlying beliefs, the value they bring, and their limitations. Finally, this paper presents 3 emerging paradigms for interpretability. The first paradigm designs models such that faithfulness can be easily measured. Another optimizes models such that explanations become faithful. The last paradigm proposes to develop models that produce both a prediction and an explanation.
Paper Structure (26 sections, 5 figures, 2 tables)

This paper contains 26 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Abstract diagram of the intrinsic paradigm, where the model is architecturally constrained, such that the constraint itself is the explanation. In cases of Decision Trees the entire model is constrained, but often (e.g. Prototype Networks or Attention) only part of the model is constrained.
  • Figure 2: Abstract diagram of the post-hoc paradigm, where a post-hoc method is used to explain a black-box model. The post-hoc method is usually an algorithm, like the gradient w.r.t. the input, but it can also be an auxiliary model.
  • Figure 3: Abstract diagram of the learn-to-faithfully explain paradigm. In most cases, this paradigm works by generating an explanation from the input, using either a model or an algorithm, this explanation is then fed into the predictive model, which has been optimized to respect the explanation.
  • Figure 4: Abstract diagram of the faithfulness measurable model paradigm. In this paradigm, the predictive model can also measure how faithful a given explanation is. The explanation can thus be produced by optimizing an initial (maybe random) explanation towards maximal faithfulness.
  • Figure 5: Abstract diagram of the self-explanation paradigm, where the same model is trained to produce both the regular predictive output and an explanation, called a self-explanation. This paradigm is often seen with Large Language Models, where both the predictive output and the self-explanations appear as generated text.