Table of Contents
Fetching ...

Explain Yourself, Briefly! Self-Explaining Neural Networks with Concise Sufficient Reasons

Shahaf Bassan, Ron Eliav, Shlomit Gur

TL;DR

The paper tackles the problem of extracting minimal, faithful explanations for neural predictions by addressing the computational and OOD limitations of post-hoc minimal sufficient reasons. It introduces sufficient subset training (SST), a self-explaining framework in which models jointly produce predictions and concise sufficient reasons via a dual-output architecture and a trio of loss terms (prediction, faithfulness, and cardinality). SST supports three masking forms—baseline, probabilistic, and robust—to realize baseline, probabilistic, and robust sufficiency, and uses a dual propagation to enforce sufficiency. Across image and language tasks, SST yields explanations that are smaller and more faithful, while maintaining comparable predictive performance and achieving substantial efficiency gains over post-hoc methods, thereby enabling scalable, human-aligned explanations integrated into the training process.

Abstract

*Minimal sufficient reasons* represent a prevalent form of explanation - the smallest subset of input features which, when held constant at their corresponding values, ensure that the prediction remains unchanged. Previous *post-hoc* methods attempt to obtain such explanations but face two main limitations: (1) Obtaining these subsets poses a computational challenge, leading most scalable methods to converge towards suboptimal, less meaningful subsets; (2) These methods heavily rely on sampling out-of-distribution input assignments, potentially resulting in counterintuitive behaviors. To tackle these limitations, we propose in this work a self-supervised training approach, which we term *sufficient subset training* (SST). Using SST, we train models to generate concise sufficient reasons for their predictions as an integral part of their output. Our results indicate that our framework produces succinct and faithful subsets substantially more efficiently than competing post-hoc methods, while maintaining comparable predictive performance.

Explain Yourself, Briefly! Self-Explaining Neural Networks with Concise Sufficient Reasons

TL;DR

The paper tackles the problem of extracting minimal, faithful explanations for neural predictions by addressing the computational and OOD limitations of post-hoc minimal sufficient reasons. It introduces sufficient subset training (SST), a self-explaining framework in which models jointly produce predictions and concise sufficient reasons via a dual-output architecture and a trio of loss terms (prediction, faithfulness, and cardinality). SST supports three masking forms—baseline, probabilistic, and robust—to realize baseline, probabilistic, and robust sufficiency, and uses a dual propagation to enforce sufficiency. Across image and language tasks, SST yields explanations that are smaller and more faithful, while maintaining comparable predictive performance and achieving substantial efficiency gains over post-hoc methods, thereby enabling scalable, human-aligned explanations integrated into the training process.

Abstract

*Minimal sufficient reasons* represent a prevalent form of explanation - the smallest subset of input features which, when held constant at their corresponding values, ensure that the prediction remains unchanged. Previous *post-hoc* methods attempt to obtain such explanations but face two main limitations: (1) Obtaining these subsets poses a computational challenge, leading most scalable methods to converge towards suboptimal, less meaningful subsets; (2) These methods heavily rely on sampling out-of-distribution input assignments, potentially resulting in counterintuitive behaviors. To tackle these limitations, we propose in this work a self-supervised training approach, which we term *sufficient subset training* (SST). Using SST, we train models to generate concise sufficient reasons for their predictions as an integral part of their output. Our results indicate that our framework produces succinct and faithful subsets substantially more efficiently than competing post-hoc methods, while maintaining comparable predictive performance.

Paper Structure

This paper contains 40 sections, 15 theorems, 21 equations, 13 figures, 7 tables.

Key Result

Theorem 1

Given a neural network classifier $f$ with ReLU activations and $\textbf{x}\in\mathbb{R}^n$, obtaining a cardinally minimal sufficient reason for $\langle f,\textbf{x}\rangle$ is

Figures (13)

  • Figure 1: An example of a sufficient reason generated by a model trained with sufficient subset training (SST) on the IMAGENET dataset, compared to those generated by post-hoc methods on standard-trained models. While explanations from Anchors and GS are larger, those from SIS are less faithful and lack subset sufficiency (details in the experiments in Section \ref{['experiments_section_main']}). SST generates explanations that are both concise and faithful, while performing this task with significantly improved efficiency. Additional examples appear in appendix \ref{['supplementary_results_appendix']}.
  • Figure 2: An illustration of the dual propagation incorporated during sufficient subset training
  • Figure 3: Examples of sufficient reasons produced by SST compared to the ones generated by post-hoc approaches for MNIST, CIFAR-10, and IMAGENET. Additional examples appear in appendix \ref{['supplementary_results_appendix']}.
  • Figure 4: The faithfulness-cardinality tradeoff in baseline-masking SST models for MNIST with varying cardinality loss coefficients, $\xi$, shows that higher $\xi$ increases mask size $\overline{S}$ but reduces faithfulness, and vice versa.
  • Figure 5: Explanations generated by SST using baseline vs. probabilistic masking. When each token in the complement $\overline{S}$ is replaced with the MASK token, the prediction stays negative. In the probabilistic setting, the prediction remains negative when values from $\overline{S}$ are randomly sampled. Further examples can be found in Appendix \ref{['supplementary_results_appendix']}.
  • ...and 8 more figures

Theorems & Definitions (20)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • Theorem 2
  • Theorem 1
  • Lemma 1
  • Definition 1
  • Lemma 2
  • Definition 2
  • ...and 10 more