Debugging Tests for Model Explanations
Julius Adebayo, Michael Muelly, Ilaria Liccardi, Been Kim
TL;DR
This work interrogates the effectiveness of post-hoc explanations for debugging machine learning models by framing bugs as data, model, or test-time contamination. It systematically evaluates a range of feature attribution methods across these bug classes and supplements the analysis with a human-subject study, revealing that attributions can detect spurious background signals but struggle with mislabeled data or model contamination, and that end-users rely more on predictions than explanations. The findings highlight invariances in many attribution methods to higher-layer parameters and caution about over-interpreting visually similar attributions for out-of-domain inputs. The study provides practical guidance on when explanations may be useful for debugging and underscores the need for broader, more rigorous evaluation of explanation-based debugging approaches.
Abstract
We investigate whether post-hoc model explanations are effective for diagnosing model errors--model debugging. In response to the challenge of explaining a model's prediction, a vast array of explanation methods have been proposed. Despite increasing use, it is unclear if they are effective. To start, we categorize \textit{bugs}, based on their source, into:~\textit{data, model, and test-time} contamination bugs. For several explanation methods, we assess their ability to: detect spurious correlation artifacts (data contamination), diagnose mislabeled training examples (data contamination), differentiate between a (partially) re-initialized model and a trained one (model contamination), and detect out-of-distribution inputs (test-time contamination). We find that the methods tested are able to diagnose a spurious background bug, but not conclusively identify mislabeled training examples. In addition, a class of methods, that modify the back-propagation algorithm are invariant to the higher layer parameters of a deep network; hence, ineffective for diagnosing model contamination. We complement our analysis with a human subject study, and find that subjects fail to identify defective models using attributions, but instead rely, primarily, on model predictions. Taken together, our results provide guidance for practitioners and researchers turning to explanations as tools for model debugging.
