What You See is What You Classify: Black Box Attributions

Steven Stalder; Nathanaël Perraudin; Radhakrishna Achanta; Fernando Perez-Cruz; Michele Volpi

What You See is What You Classify: Black Box Attributions

Steven Stalder, Nathanaël Perraudin, Radhakrishna Achanta, Fernando Perez-Cruz, Michele Volpi

TL;DR

The paper tackles the challenge of explaining deep image classifiers by introducing an Explainer that learns dense, class-specific attribution masks for a frozen black-box classifier (the Explanandum). The Explainer generates a per-class mask set $\mathbf{S}$ and an aggregated target mask $\mathbf{m}$ to localize classifier-relevant regions in a single forward pass, optimizing a four-term loss $L_E = L_c + \lambda_e L_e + \lambda_a L_a + \lambda_{tv} L_{tv}$ that blends accurate classification on masked inputs, background entropy minimization, area constraints, and mask smoothness. Empirically, the method yields sharp, boundary-accurate masks that are more class-specific than baselines like Grad-CAM, RISE, EP, and RTIS, and achieves segmentation-like accuracy on VOC-2007 and COCO-2014 without requiring pixel-wise annotations. These contributions offer a practical, efficient approach for producing faithful explanations of black-box classifiers and suggest avenues for retraining explanations to improve robustness and generalization across architectures and datasets.

Abstract

An important step towards explaining deep image classifiers lies in the identification of image regions that contribute to individual class scores in the model's output. However, doing this accurately is a difficult task due to the black-box nature of such networks. Most existing approaches find such attributions either using activations and gradients or by repeatedly perturbing the input. We instead address this challenge by training a second deep network, the Explainer, to predict attributions for a pre-trained black-box classifier, the Explanandum. These attributions are provided in the form of masks that only show the classifier-relevant parts of an image, masking out the rest. Our approach produces sharper and more boundary-precise masks when compared to the saliency maps generated by other methods. Moreover, unlike most existing approaches, ours is capable of directly generating very distinct class-specific masks in a single forward pass. This makes the proposed method very efficient during inference. We show that our attributions are superior to established methods both visually and quantitatively with respect to the PASCAL VOC-2007 and Microsoft COCO-2014 datasets.

What You See is What You Classify: Black Box Attributions

TL;DR

and an aggregated target mask

to localize classifier-relevant regions in a single forward pass, optimizing a four-term loss

that blends accurate classification on masked inputs, background entropy minimization, area constraints, and mask smoothness. Empirically, the method yields sharp, boundary-accurate masks that are more class-specific than baselines like Grad-CAM, RISE, EP, and RTIS, and achieves segmentation-like accuracy on VOC-2007 and COCO-2014 without requiring pixel-wise annotations. These contributions offer a practical, efficient approach for producing faithful explanations of black-box classifiers and suggest avenues for retraining explanations to improve robustness and generalization across architectures and datasets.

Abstract

Paper Structure (22 sections, 9 equations, 8 figures, 5 tables)

This paper contains 22 sections, 9 equations, 8 figures, 5 tables.

Introduction
Related Work
Activation and gradient-based methods.
Local perturbations and local models.
Training the Explainer
Training loss
Classification loss: $\mathcal{L}_{c}(\mathbf{x}, \mathcal{Y}, \mathbf{m})$.
Negative entropy loss: $\mathcal{L}_{e}(\mathbf{x}, \tilde{\mathbf{m}})$.
Area loss: $\mathcal{L}_{a}(\mathbf{m}, \mathbf{n}, \mathbf{S})$.
Smoothness Loss: $\mathcal{L}_{tv}(\mathbf{m}, \mathbf{n})$.
Experiments
Visual Comparison
Class-specific masking
Segmentation accuracy
Limitations
...and 7 more sections

Figures (8)

Figure 1: Visual comparison of per-class attributions provided for VOC-2007 by our Explainer, alongside Grad-CAM (GCam) selvaraju17iccv and Extremal Perturbations (EP) fong19iccv for the VGG-16 architecture simonyan14arxiv. Our attributions have sharper boundaries and at the same time are more class-accurate than Grad-CAM or EP. Attributions for only five out of the twenty VOC-2007 classes are shown for convenience. Colormap ranges from low (blue) to high (red) saliency.
Figure 2: Overview of our method. Given a pre-trained Explanandum$\mathcal{F}$, whose weights are frozen, the Explainer network $\mathcal{E}$ learns to produce masks $\mathbf{s}_c$ for each class. The masks corresponding to the label(s) associated with the input image (shown in green) are merged by taking the pixel-wise maximum over masks (shown as $\uparrow$'s), to obtain a target mask$\mathbf{m}$ and its complement $\tilde{\mathbf{m}}$ (i.e. inverted mask). All the other masks (shown in red), which do not correspond to the labels of the input image but might still score positively for the given image, are merged separately to obtain the non-target mask$\mathbf{n}$, which is also used in the loss term. The images obtained by multiplying the target mask and its complement with the input image (shown by $\times$'s) are fed to the given pre-trained Explanandum separately, generating two outputs on which we compute losses. The set of per-class masks $\mathbf{S}$ and the aggregated target mask$\mathbf{m}$ serve as the attributions provided by our Explainer.
Figure 3: Visual comparison of our attributions for a VGG-16 network fine-tuned on images from the VOC-2007 dataset and frozen. Our attributions are much more effective at retaining object class regions and discarding the rest. Examples 13 and 14 show cases where our Explainer is inaccurate.
Figure 4: Top-5 class-wise attributions for the VGG-16 classifier, for 8 random images from the VOC-2007 test set. Class-wise masks are sorted according to their average activation on the image plane. TAC corresponds to the top activating class, please refer to the legend in Tab. \ref{['tab:legend']}. CLS shows the respective class probabilities by the classifier on the original (unmasked) images, multiplied by 100. AMA shows the average mask activations for the respective class, also multiplied by 100.
Figure 5: Top-5 class-wise attributions for the ResNet-50 classifier, for 8 random images from the VOC-2007 test set. Class-wise masks are sorted according to their average activation on the image plane. TAC corresponds to the top activating class, please refer to the legend in Tab. \ref{['tab:legend']}. CLS shows the respective class probabilities by the classifier on the original (unmasked) images, multiplied by 100. AMA shows the average mask activations for the respective class, also multiplied by 100.
...and 3 more figures

What You See is What You Classify: Black Box Attributions

TL;DR

Abstract

What You See is What You Classify: Black Box Attributions

Authors

TL;DR

Abstract

Table of Contents

Figures (8)