Explain Yourself, Briefly! Self-Explaining Neural Networks with Concise Sufficient Reasons
Shahaf Bassan, Ron Eliav, Shlomit Gur
TL;DR
The paper tackles the problem of extracting minimal, faithful explanations for neural predictions by addressing the computational and OOD limitations of post-hoc minimal sufficient reasons. It introduces sufficient subset training (SST), a self-explaining framework in which models jointly produce predictions and concise sufficient reasons via a dual-output architecture and a trio of loss terms (prediction, faithfulness, and cardinality). SST supports three masking forms—baseline, probabilistic, and robust—to realize baseline, probabilistic, and robust sufficiency, and uses a dual propagation to enforce sufficiency. Across image and language tasks, SST yields explanations that are smaller and more faithful, while maintaining comparable predictive performance and achieving substantial efficiency gains over post-hoc methods, thereby enabling scalable, human-aligned explanations integrated into the training process.
Abstract
*Minimal sufficient reasons* represent a prevalent form of explanation - the smallest subset of input features which, when held constant at their corresponding values, ensure that the prediction remains unchanged. Previous *post-hoc* methods attempt to obtain such explanations but face two main limitations: (1) Obtaining these subsets poses a computational challenge, leading most scalable methods to converge towards suboptimal, less meaningful subsets; (2) These methods heavily rely on sampling out-of-distribution input assignments, potentially resulting in counterintuitive behaviors. To tackle these limitations, we propose in this work a self-supervised training approach, which we term *sufficient subset training* (SST). Using SST, we train models to generate concise sufficient reasons for their predictions as an integral part of their output. Our results indicate that our framework produces succinct and faithful subsets substantially more efficiently than competing post-hoc methods, while maintaining comparable predictive performance.
