Table of Contents
Fetching ...

Activation Matching for Explanation Generation

Pirzada Suhail, Aditya Anand, Amit Sethi

TL;DR

The paper addresses the need for faithful, minimal explanations of decisions made by frozen image classifiers. It introduces an activation-matching framework that learns a lightweight autoencoder to output a binary mask $m$, producing an explanation $e = m \odot x$ that preserves both the classifier’s prediction and its intermediate activations via a multi-term objective combining $\mathcal{L}_{\text{act}}$, $\mathcal{L}_{\text{KL}}$, and $\mathcal{L}_{\text{CE}}$, along with mask priors $\mathcal{L}_{\text{area}}$, $\mathcal{L}_{\text{bin}}$, and $\mathcal{L}_{\text{tv}}$, plus an abductive robustness term $\mathcal{L}_{\text{rob}}$. The approach yields highly sparse, binary explanations that focus on essential evidence while discarding irrelevancies, demonstrated on ImageNet with ResNet-18. Key contributions include the explicit multi-layer activation alignment, probabilistic and predictive fidelity losses, principled sparsity and crispness priors, and a robustness constraint that preserves explanations under perturbations. This framework enables trustworthy, interpretable debugging and model understanding by exposing the minimal decision-supporting regions of input data.

Abstract

In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image $x$ and a frozen model $f$, we train a lightweight autoencoder to output a binary mask $m$ such that the explanation $e = m \odot x$ preserves both the model's prediction and the intermediate activations of \(x\). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.

Activation Matching for Explanation Generation

TL;DR

The paper addresses the need for faithful, minimal explanations of decisions made by frozen image classifiers. It introduces an activation-matching framework that learns a lightweight autoencoder to output a binary mask , producing an explanation that preserves both the classifier’s prediction and its intermediate activations via a multi-term objective combining , , and , along with mask priors , , and , plus an abductive robustness term . The approach yields highly sparse, binary explanations that focus on essential evidence while discarding irrelevancies, demonstrated on ImageNet with ResNet-18. Key contributions include the explicit multi-layer activation alignment, probabilistic and predictive fidelity losses, principled sparsity and crispness priors, and a robustness constraint that preserves explanations under perturbations. This framework enables trustworthy, interpretable debugging and model understanding by exposing the minimal decision-supporting regions of input data.

Abstract

In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image and a frozen model , we train a lightweight autoencoder to output a binary mask such that the explanation preserves both the model's prediction and the intermediate activations of . Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.

Paper Structure

This paper contains 14 sections, 9 equations, 4 figures.

Figures (4)

  • Figure 1: Original Image, 0/1 Mask, and Explanation
  • Figure 2: Explanations for sample Images of Otter.
  • Figure 3: Explanations for misclassified Images.
  • Figure 4: Effect of varying loss weights on generated explanations. (1) With heavily weighted area and total variation losses, the explanation becomes extremely small and localized. (2) Example of shortcut learning: the model highlights not only the dog but also the leash, reflecting dataset biases where dogs frequently appear with leashes. (3) With relaxed constraints, a larger portion of the dog and some background regions are included. (4) Further relaxation of the area loss highlights the entire dog, demonstrating how the approach can be extended toward instance-level segmentation.