Activation Matching for Explanation Generation

Pirzada Suhail; Aditya Anand; Amit Sethi

Activation Matching for Explanation Generation

Pirzada Suhail, Aditya Anand, Amit Sethi

TL;DR

The paper addresses the need for faithful, minimal explanations of decisions made by frozen image classifiers. It introduces an activation-matching framework that learns a lightweight autoencoder to output a binary mask $m$, producing an explanation $e = m \odot x$ that preserves both the classifier’s prediction and its intermediate activations via a multi-term objective combining $\mathcal{L}_{\text{act}}$, $\mathcal{L}_{\text{KL}}$, and $\mathcal{L}_{\text{CE}}$, along with mask priors $\mathcal{L}_{\text{area}}$, $\mathcal{L}_{\text{bin}}$, and $\mathcal{L}_{\text{tv}}$, plus an abductive robustness term $\mathcal{L}_{\text{rob}}$. The approach yields highly sparse, binary explanations that focus on essential evidence while discarding irrelevancies, demonstrated on ImageNet with ResNet-18. Key contributions include the explicit multi-layer activation alignment, probabilistic and predictive fidelity losses, principled sparsity and crispness priors, and a robustness constraint that preserves explanations under perturbations. This framework enables trustworthy, interpretable debugging and model understanding by exposing the minimal decision-supporting regions of input data.

Abstract

In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image $x$ and a frozen model $f$, we train a lightweight autoencoder to output a binary mask $m$ such that the explanation $e = m \odot x$ preserves both the model's prediction and the intermediate activations of $x$. Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.

Activation Matching for Explanation Generation

TL;DR

, producing an explanation

that preserves both the classifier’s prediction and its intermediate activations via a multi-term objective combining

, and

, along with mask priors

, and

, plus an abductive robustness term

. The approach yields highly sparse, binary explanations that focus on essential evidence while discarding irrelevancies, demonstrated on ImageNet with ResNet-18. Key contributions include the explicit multi-layer activation alignment, probabilistic and predictive fidelity losses, principled sparsity and crispness priors, and a robustness constraint that preserves explanations under perturbations. This framework enables trustworthy, interpretable debugging and model understanding by exposing the minimal decision-supporting regions of input data.

Abstract

and a frozen model

, we train a lightweight autoencoder to output a binary mask

such that the explanation

preserves both the model's prediction and the intermediate activations of

. Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.

Activation Matching for Explanation Generation

TL;DR

Abstract

Activation Matching for Explanation Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)