Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations
Andrew Slavin Ross, Michael C. Hughes, Finale Doshi-Velez
TL;DR
The paper tackles robustness and trust in neural networks under training-test distribution shifts by leveraging input-gradient explanations. It introduces a differentiable loss that penalizes gradients in user-specified irrelevant regions to enforce explanations aligned with domain knowledge, and a find-another-explanation method to uncover multiple, equally accurate yet qualitatively different decision rules when annotations are unavailable. Empirical results on toy and real datasets show gradient explanations are faithful and can match LIME, while explanation regularization improves generalization and helps reveal confounds; the iterative find-another-explanation approach yields diverse models for human inspection. Overall, the work demonstrates scalable, explanation-driven regularization as a pathway to models that generalize better for the right reasons and offers practical tooling for model auditing and human-in-the-loop refinement.
Abstract
Neural networks are among the most accurate supervised learning methods in use today, but their opacity makes them difficult to trust in critical applications, especially when conditions in training differ from those in test. Recent work on explanations for black-box models has produced tools (e.g. LIME) to show the implicit rules behind predictions, which can help us identify when models are right for the wrong reasons. However, these methods do not scale to explaining entire datasets and cannot correct the problems they reveal. We introduce a method for efficiently explaining and regularizing differentiable models by examining and selectively penalizing their input gradients, which provide a normal to the decision boundary. We apply these penalties both based on expert annotation and in an unsupervised fashion that encourages diverse models with qualitatively different decision boundaries for the same classification problem. On multiple datasets, we show our approach generates faithful explanations and models that generalize much better when conditions differ between training and test.
