What You See is What You Classify: Black Box Attributions
Steven Stalder, Nathanaël Perraudin, Radhakrishna Achanta, Fernando Perez-Cruz, Michele Volpi
TL;DR
The paper tackles the challenge of explaining deep image classifiers by introducing an Explainer that learns dense, class-specific attribution masks for a frozen black-box classifier (the Explanandum). The Explainer generates a per-class mask set $\mathbf{S}$ and an aggregated target mask $\mathbf{m}$ to localize classifier-relevant regions in a single forward pass, optimizing a four-term loss $L_E = L_c + \lambda_e L_e + \lambda_a L_a + \lambda_{tv} L_{tv}$ that blends accurate classification on masked inputs, background entropy minimization, area constraints, and mask smoothness. Empirically, the method yields sharp, boundary-accurate masks that are more class-specific than baselines like Grad-CAM, RISE, EP, and RTIS, and achieves segmentation-like accuracy on VOC-2007 and COCO-2014 without requiring pixel-wise annotations. These contributions offer a practical, efficient approach for producing faithful explanations of black-box classifiers and suggest avenues for retraining explanations to improve robustness and generalization across architectures and datasets.
Abstract
An important step towards explaining deep image classifiers lies in the identification of image regions that contribute to individual class scores in the model's output. However, doing this accurately is a difficult task due to the black-box nature of such networks. Most existing approaches find such attributions either using activations and gradients or by repeatedly perturbing the input. We instead address this challenge by training a second deep network, the Explainer, to predict attributions for a pre-trained black-box classifier, the Explanandum. These attributions are provided in the form of masks that only show the classifier-relevant parts of an image, masking out the rest. Our approach produces sharper and more boundary-precise masks when compared to the saliency maps generated by other methods. Moreover, unlike most existing approaches, ours is capable of directly generating very distinct class-specific masks in a single forward pass. This makes the proposed method very efficient during inference. We show that our attributions are superior to established methods both visually and quantitatively with respect to the PASCAL VOC-2007 and Microsoft COCO-2014 datasets.
