Table of Contents
Fetching ...

Opti-CAM: Optimizing saliency maps for interpretability

Hanwei Zhang, Felipe Torres, Ronan Sicre, Yannis Avrithis, Stephane Ayache

TL;DR

Opti-CAM introduces a per-image saliency map that is a linear combination of feature maps with image-specific weights, optimized to maximize the class logit after masking. By marrying CAM-based and masking-based paradigms, it yields more spread saliency maps that capture whole objects and contextual cues without requiring extra training data, and it introduces a new AG metric to complement existing attribution metrics. Across CNNs and transformers on ImageNet and medical datasets, Opti-CAM consistently improves on key classification-based attribution metrics and provides compelling visualizations, while also highlighting that localization performance and interpretability are not perfectly aligned. The work includes extensive ablations, robust sanity checks, and practical implementation details, supporting the method’s reproducibility and potential adoption for interpretability in high-stakes domains.

Abstract

Methods based on class activation maps (CAM) provide a simple mechanism to interpret predictions of convolutional neural networks by using linear combinations of feature maps as saliency maps. By contrast, masking-based methods optimize a saliency map directly in the image space or learn it by training another network on additional data. In this work we introduce Opti-CAM, combining ideas from CAM-based and masking-based approaches. Our saliency map is a linear combination of feature maps, where weights are optimized per image such that the logit of the masked image for a given class is maximized. We also fix a fundamental flaw in two of the most common evaluation metrics of attribution methods. On several datasets, Opti-CAM largely outperforms other CAM-based approaches according to the most relevant classification metrics. We provide empirical evidence supporting that localization and classifier interpretability are not necessarily aligned.

Opti-CAM: Optimizing saliency maps for interpretability

TL;DR

Opti-CAM introduces a per-image saliency map that is a linear combination of feature maps with image-specific weights, optimized to maximize the class logit after masking. By marrying CAM-based and masking-based paradigms, it yields more spread saliency maps that capture whole objects and contextual cues without requiring extra training data, and it introduces a new AG metric to complement existing attribution metrics. Across CNNs and transformers on ImageNet and medical datasets, Opti-CAM consistently improves on key classification-based attribution metrics and provides compelling visualizations, while also highlighting that localization performance and interpretability are not perfectly aligned. The work includes extensive ablations, robust sanity checks, and practical implementation details, supporting the method’s reproducibility and potential adoption for interpretability in high-stakes domains.

Abstract

Methods based on class activation maps (CAM) provide a simple mechanism to interpret predictions of convolutional neural networks by using linear combinations of feature maps as saliency maps. By contrast, masking-based methods optimize a saliency map directly in the image space or learn it by training another network on additional data. In this work we introduce Opti-CAM, combining ideas from CAM-based and masking-based approaches. Our saliency map is a linear combination of feature maps, where weights are optimized per image such that the logit of the masked image for a given class is maximized. We also fix a fundamental flaw in two of the most common evaluation metrics of attribution methods. On several datasets, Opti-CAM largely outperforms other CAM-based approaches according to the most relevant classification metrics. We provide empirical evidence supporting that localization and classifier interpretability are not necessarily aligned.
Paper Structure (39 sections, 11 equations, 11 figures, 11 tables)

This paper contains 39 sections, 11 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 2: Saliency maps obtained by different methods for ImageNet (top two rows), Chest X-ray (row 3) and Kvasir (row 4) with VGG. Ground truth class shown on the left of the input image.
  • Figure A3: Failure examples of Opti-CAM regarding insertion/deletion.
  • Figure A4: Effect of selectivity (raising element-wise to exponent $\alpha$) of saliency maps on classification performance. $\operatorname{AD}$/$\operatorname{AI}$: average drop/increase chattopadhay2018grad; $\operatorname{AG}$: average gain (ours); $\downarrow$ / $\uparrow$: lower / higher is better.
  • Figure A5: Classification metrics vs. number of iterations for different learning rates, using VGG-16 on 1000 images of ImageNet. $\operatorname{AD}$/$\operatorname{AI}$: average drop/increase chattopadhay2018grad; $\operatorname{AG}$: average gain (ours); $\downarrow$ / $\uparrow$: lower / higher is better.
  • Figure A6: Sanity check of Opti-CAM on $1,000$ images of ImageNet validation set using ResNet50. Similarity between saliency maps by original and randomized network, where layers are progressively replaced by random ones.
  • ...and 6 more figures