Table of Contents
Fetching ...

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations

Andrew Slavin Ross, Michael C. Hughes, Finale Doshi-Velez

TL;DR

The paper tackles robustness and trust in neural networks under training-test distribution shifts by leveraging input-gradient explanations. It introduces a differentiable loss that penalizes gradients in user-specified irrelevant regions to enforce explanations aligned with domain knowledge, and a find-another-explanation method to uncover multiple, equally accurate yet qualitatively different decision rules when annotations are unavailable. Empirical results on toy and real datasets show gradient explanations are faithful and can match LIME, while explanation regularization improves generalization and helps reveal confounds; the iterative find-another-explanation approach yields diverse models for human inspection. Overall, the work demonstrates scalable, explanation-driven regularization as a pathway to models that generalize better for the right reasons and offers practical tooling for model auditing and human-in-the-loop refinement.

Abstract

Neural networks are among the most accurate supervised learning methods in use today, but their opacity makes them difficult to trust in critical applications, especially when conditions in training differ from those in test. Recent work on explanations for black-box models has produced tools (e.g. LIME) to show the implicit rules behind predictions, which can help us identify when models are right for the wrong reasons. However, these methods do not scale to explaining entire datasets and cannot correct the problems they reveal. We introduce a method for efficiently explaining and regularizing differentiable models by examining and selectively penalizing their input gradients, which provide a normal to the decision boundary. We apply these penalties both based on expert annotation and in an unsupervised fashion that encourages diverse models with qualitatively different decision boundaries for the same classification problem. On multiple datasets, we show our approach generates faithful explanations and models that generalize much better when conditions differ between training and test.

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations

TL;DR

The paper tackles robustness and trust in neural networks under training-test distribution shifts by leveraging input-gradient explanations. It introduces a differentiable loss that penalizes gradients in user-specified irrelevant regions to enforce explanations aligned with domain knowledge, and a find-another-explanation method to uncover multiple, equally accurate yet qualitatively different decision rules when annotations are unavailable. Empirical results on toy and real datasets show gradient explanations are faithful and can match LIME, while explanation regularization improves generalization and helps reveal confounds; the iterative find-another-explanation approach yields diverse models for human inspection. Overall, the work demonstrates scalable, explanation-driven regularization as a pathway to models that generalize better for the right reasons and offers practical tooling for model auditing and human-in-the-loop refinement.

Abstract

Neural networks are among the most accurate supervised learning methods in use today, but their opacity makes them difficult to trust in critical applications, especially when conditions in training differ from those in test. Recent work on explanations for black-box models has produced tools (e.g. LIME) to show the implicit rules behind predictions, which can help us identify when models are right for the wrong reasons. However, these methods do not scale to explaining entire datasets and cannot correct the problems they reveal. We introduce a method for efficiently explaining and regularizing differentiable models by examining and selectively penalizing their input gradients, which provide a normal to the decision boundary. We apply these penalties both based on expert annotation and in an unsupervised fashion that encourages diverse models with qualitatively different decision boundaries for the same classification problem. On multiple datasets, we show our approach generates faithful explanations and models that generalize much better when conditions differ between training and test.

Paper Structure

This paper contains 15 sections, 2 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Input gradients lie normal to the model's decision boundary. Examples above are for simple, 2D, two- and three-class datasets, with input gradients taken with respect to a two hidden layer multilayer perceptron with ReLU activations. Probability input gradients are sharpest near decision boundaries, while log probabilities input gradients are more consistent within decision regions. The sum of log probability gradients contains information about the full model.
  • Figure 2: Gradient vs. LIME explanations of nine perceptron predictions on the Toy Color dataset. For gradients, we plot dots above pixels identified by $M_{0.67}\left[f_X\right]$ (the top 33% largest-magnitude input gradients), and for LIME, we select the top 6 features (up to 3 can reside in the same RGB pixel). Both methods suggest that the model learns the corner rule.
  • Figure 3: Implicit rule transitions as we increase $\lambda_1$ and the number of nonzero rows of $A$. Pairs of points represent the fraction of large-magnitude ($c=0.67$) gradient components in the corners and top-middle for 1000 test examples, which almost always add to 1 (indicating the model is most sensitive to these elements alone, even during transitions). Note there is a wide regime where the model learns a hybrid of both rules.
  • Figure 4: Rule discovery using find-another-explanation method with 0.67 cutoff and $\lambda_1=10^3$ for $\theta_1$ and $\lambda_1=10^6$ for $\theta_2$. Note how the first two iterations produce explanations corresponding to the two rules in the dataset while the third produces very noisy explanations (with low accuracies).
  • Figure 5: Words identified by LIME vs. gradients on an example from the atheism vs. Christianity subset of 20 Newsgroups. More examples are available at https://github.com/dtak/rrr. Words are blue if they support soc.religion.christian and orange if they support alt.atheism, with opacity equal to the ratio of the magnitude of the word's weight to the largest magnitude weight. LIME generates sparser explanations but the weights and signs of terms identified by both methods match closely. Note that both methods reveal some aspects of the model that are intuitive ("church" and "service" are associated with Christianity), some aspects that are not ("13" is associated with Christianity, "edu" with atheism), and some that are debatable ("freedom" is associated with atheism, "friends" with Christianity).
  • ...and 8 more figures