Activation Matching for Explanation Generation
Pirzada Suhail, Aditya Anand, Amit Sethi
TL;DR
The paper addresses the need for faithful, minimal explanations of decisions made by frozen image classifiers. It introduces an activation-matching framework that learns a lightweight autoencoder to output a binary mask $m$, producing an explanation $e = m \odot x$ that preserves both the classifier’s prediction and its intermediate activations via a multi-term objective combining $\mathcal{L}_{\text{act}}$, $\mathcal{L}_{\text{KL}}$, and $\mathcal{L}_{\text{CE}}$, along with mask priors $\mathcal{L}_{\text{area}}$, $\mathcal{L}_{\text{bin}}$, and $\mathcal{L}_{\text{tv}}$, plus an abductive robustness term $\mathcal{L}_{\text{rob}}$. The approach yields highly sparse, binary explanations that focus on essential evidence while discarding irrelevancies, demonstrated on ImageNet with ResNet-18. Key contributions include the explicit multi-layer activation alignment, probabilistic and predictive fidelity losses, principled sparsity and crispness priors, and a robustness constraint that preserves explanations under perturbations. This framework enables trustworthy, interpretable debugging and model understanding by exposing the minimal decision-supporting regions of input data.
Abstract
In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image $x$ and a frozen model $f$, we train a lightweight autoencoder to output a binary mask $m$ such that the explanation $e = m \odot x$ preserves both the model's prediction and the intermediate activations of \(x\). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.
