Table of Contents
Fetching ...

Debugging Tests for Model Explanations

Julius Adebayo, Michael Muelly, Ilaria Liccardi, Been Kim

TL;DR

This work interrogates the effectiveness of post-hoc explanations for debugging machine learning models by framing bugs as data, model, or test-time contamination. It systematically evaluates a range of feature attribution methods across these bug classes and supplements the analysis with a human-subject study, revealing that attributions can detect spurious background signals but struggle with mislabeled data or model contamination, and that end-users rely more on predictions than explanations. The findings highlight invariances in many attribution methods to higher-layer parameters and caution about over-interpreting visually similar attributions for out-of-domain inputs. The study provides practical guidance on when explanations may be useful for debugging and underscores the need for broader, more rigorous evaluation of explanation-based debugging approaches.

Abstract

We investigate whether post-hoc model explanations are effective for diagnosing model errors--model debugging. In response to the challenge of explaining a model's prediction, a vast array of explanation methods have been proposed. Despite increasing use, it is unclear if they are effective. To start, we categorize \textit{bugs}, based on their source, into:~\textit{data, model, and test-time} contamination bugs. For several explanation methods, we assess their ability to: detect spurious correlation artifacts (data contamination), diagnose mislabeled training examples (data contamination), differentiate between a (partially) re-initialized model and a trained one (model contamination), and detect out-of-distribution inputs (test-time contamination). We find that the methods tested are able to diagnose a spurious background bug, but not conclusively identify mislabeled training examples. In addition, a class of methods, that modify the back-propagation algorithm are invariant to the higher layer parameters of a deep network; hence, ineffective for diagnosing model contamination. We complement our analysis with a human subject study, and find that subjects fail to identify defective models using attributions, but instead rely, primarily, on model predictions. Taken together, our results provide guidance for practitioners and researchers turning to explanations as tools for model debugging.

Debugging Tests for Model Explanations

TL;DR

This work interrogates the effectiveness of post-hoc explanations for debugging machine learning models by framing bugs as data, model, or test-time contamination. It systematically evaluates a range of feature attribution methods across these bug classes and supplements the analysis with a human-subject study, revealing that attributions can detect spurious background signals but struggle with mislabeled data or model contamination, and that end-users rely more on predictions than explanations. The findings highlight invariances in many attribution methods to higher-layer parameters and caution about over-interpreting visually similar attributions for out-of-domain inputs. The study provides practical guidance on when explanations may be useful for debugging and underscores the need for broader, more rigorous evaluation of explanation-based debugging approaches.

Abstract

We investigate whether post-hoc model explanations are effective for diagnosing model errors--model debugging. In response to the challenge of explaining a model's prediction, a vast array of explanation methods have been proposed. Despite increasing use, it is unclear if they are effective. To start, we categorize \textit{bugs}, based on their source, into:~\textit{data, model, and test-time} contamination bugs. For several explanation methods, we assess their ability to: detect spurious correlation artifacts (data contamination), diagnose mislabeled training examples (data contamination), differentiate between a (partially) re-initialized model and a trained one (model contamination), and detect out-of-distribution inputs (test-time contamination). We find that the methods tested are able to diagnose a spurious background bug, but not conclusively identify mislabeled training examples. In addition, a class of methods, that modify the back-propagation algorithm are invariant to the higher layer parameters of a deep network; hence, ineffective for diagnosing model contamination. We complement our analysis with a human subject study, and find that subjects fail to identify defective models using attributions, but instead rely, primarily, on model predictions. Taken together, our results provide guidance for practitioners and researchers turning to explanations as tools for model debugging.

Paper Structure

This paper contains 40 sections, 1 equation, 64 figures, 4 tables.

Figures (64)

  • Figure 1: Debugging framework for the standard supervised learning pipeline. Schematic of the standard supervised learning pipeline along with examples of bugs that can occur at each stage of the pipeline. The categorization captures defects that can occur with the training data, model, and at test-time. We term these: data, model, and test-time contamination tests.
  • Figure 2: Attribution Methods Considered. The Figure shows feature attributions for two inputs for a CNN model trained to distinguish between birds and dogs.
  • Figure 3: Feature Attributions for Spurious Correlation Bugs. Figure shows attributions for $4$ inputs for the BVD-CNN trained on spurious data. A & B show two dog examples, and C & D are bird examples. The first row shows the input (dog or bird) on a spurious background. The second row shows the attributions of only the spurious background. Notably, we observe that the feature attribution methods place emphasis on the background. See Table \ref{['tab:spuriousmetrics']} for metrics.
  • Figure 4: Ground Truth Attribution for Spurious Correlation.
  • Figure 5: A: Participant Responses from User Study. Box plot of participants responses for $3$ attribution methods: Gradient, SmoothGrad, and Integrated Gradients, and $5$ model conditions tested. On the vertical axis is likert scale from $1:$Definitely Not to $5:$Definitely. Participants were instructed to select 'Definitely' if they deemed the dog-breed classification model ready to be sold to customers. B: Motivation for Selection. Participants' selected motivations (%) for the recommendation made. As shown in the legend, users could select one of 4 options or insert an open-ended response.
  • ...and 59 more figures