Table of Contents
Fetching ...

Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation

Ziheng Zhang, Jianyang Gu, Arpita Chowdhury, Zheda Mai, David Carlyn, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao

TL;DR

Finer-CAM tackles the challenge of fine-grained visual explanations by shifting from explaining a single target class in isolation to highlighting the differences between a target class and visually similar classes. By computing activation weights from the logit difference $y^c - \gamma y^d$ (and aggregating across multiple references when desired), it suppresses features shared with similar classes and emphasizes discriminative cues. The approach remains CAM-friendly, supports multi-modal zero-shot scenarios, and offers a tunable comparison strength $\gamma$ to balance coarse contours with fine details. Empirical results on five fine-grained datasets show improved relative confidence drop and localization over strong CAM baselines, and the method provides a practical, efficient tool for more precise visual explanations with potential applications in verification and attribute-based localization.

Abstract

Class activation map (CAM) has been widely used to highlight image regions that contribute to class predictions. Despite its simplicity and computational efficiency, CAM often struggles to identify discriminative regions that distinguish visually similar fine-grained classes. Prior efforts address this limitation by introducing more sophisticated explanation processes, but at the cost of extra complexity. In this paper, we propose Finer-CAM, a method that retains CAM's efficiency while achieving precise localization of discriminative regions. Our key insight is that the deficiency of CAM lies not in "how" it explains, but in "what" it explains. Specifically, previous methods attempt to identify all cues contributing to the target class's logit value, which inadvertently also activates regions predictive of visually similar classes. By explicitly comparing the target class with similar classes and spotting their differences, Finer-CAM suppresses features shared with other classes and emphasizes the unique, discriminative details of the target class. Finer-CAM is easy to implement, compatible with various CAM methods, and can be extended to multi-modal models for accurate localization of specific concepts. Additionally, Finer-CAM allows adjustable comparison strength, enabling users to selectively highlight coarse object contours or fine discriminative details. Quantitatively, we show that masking out the top 5% of activated pixels by Finer-CAM results in a larger relative confidence drop compared to baselines. The source code and demo are available at https://github.com/Imageomics/Finer-CAM.

Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation

TL;DR

Finer-CAM tackles the challenge of fine-grained visual explanations by shifting from explaining a single target class in isolation to highlighting the differences between a target class and visually similar classes. By computing activation weights from the logit difference (and aggregating across multiple references when desired), it suppresses features shared with similar classes and emphasizes discriminative cues. The approach remains CAM-friendly, supports multi-modal zero-shot scenarios, and offers a tunable comparison strength to balance coarse contours with fine details. Empirical results on five fine-grained datasets show improved relative confidence drop and localization over strong CAM baselines, and the method provides a practical, efficient tool for more precise visual explanations with potential applications in verification and attribute-based localization.

Abstract

Class activation map (CAM) has been widely used to highlight image regions that contribute to class predictions. Despite its simplicity and computational efficiency, CAM often struggles to identify discriminative regions that distinguish visually similar fine-grained classes. Prior efforts address this limitation by introducing more sophisticated explanation processes, but at the cost of extra complexity. In this paper, we propose Finer-CAM, a method that retains CAM's efficiency while achieving precise localization of discriminative regions. Our key insight is that the deficiency of CAM lies not in "how" it explains, but in "what" it explains. Specifically, previous methods attempt to identify all cues contributing to the target class's logit value, which inadvertently also activates regions predictive of visually similar classes. By explicitly comparing the target class with similar classes and spotting their differences, Finer-CAM suppresses features shared with other classes and emphasizes the unique, discriminative details of the target class. Finer-CAM is easy to implement, compatible with various CAM methods, and can be extended to multi-modal models for accurate localization of specific concepts. Additionally, Finer-CAM allows adjustable comparison strength, enabling users to selectively highlight coarse object contours or fine discriminative details. Quantitatively, we show that masking out the top 5% of activated pixels by Finer-CAM results in a larger relative confidence drop compared to baselines. The source code and demo are available at https://github.com/Imageomics/Finer-CAM.
Paper Structure (29 sections, 12 equations, 13 figures, 7 tables)

This paper contains 29 sections, 12 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Illustration of Finer-CAM. Left: Sorted cosine similarity between linear classifier weights, averaged across all classes (details in the supplementary). Many pairs of classes are highly similar, yet neural networks can effectively distinguish them to achieve high fine-grained classification accuracy. Middle: Standard CAM methods highlight main regions contributing to the target class's logit value, inadvertently including regions predictive of similar classes and overshadowing fine discriminative details. Right: We propose Finer-CAM to explicitly compare the target class with similar classes and spot the difference, enabling accurate localization of discriminative details.
  • Figure 2: Finer-CAM can be extended to multi-modal zero-shot models to accurately highlight or mask out specific concepts.
  • Figure 3: The pipeline of the proposed Finer-CAM method, with Grad-CAM as the baseline. An image is first passed through the encoder blocks and the subsequent linear classifier to acquire feature maps at the desired network layer and the prediction logits, respectively. Different from standard Grad-CAM, we calculate the gradients of the logit difference between the target class and a visually similar class. In this way, the produced CAM effectively captures and highlights subtle differences between these two classes.
  • Figure 4: The visualization comparison between the proposed Finer-CAM and baseline CAM methods. For each group, we show the target image, one example image from the most similar class, baseline CAM, and Finer-CAM's results. Finer-CAM localizes and emphasizes the discriminative details, and also suppresses some noise in the baseline CAMs.
  • Figure 5: The saliency maps by Grad-CAM and Finer-CAM with deletion curves. In each group, the top-left is the target image, while the bottom-left is an example image from the most similar class. In addition to the prediction confidence of the target class, we also show the curve of the second predicted class.
  • ...and 8 more figures