Table of Contents
Fetching ...

SUNY: A Visual Interpretation Framework for Convolutional Neural Networks from a Necessary and Sufficient Perspective

Xiwei Xuan, Ziquan Deng, Hsuan-Tien Lin, Zhaodan Kong, Kwan-Liu Ma

TL;DR

This paper addresses the problem of explaining CNN decisions by embedding causal reasoning into visual explanations. The SUNY framework treats either input features or internal filters as hypothetical causes and uses bi-directional N-S Shapley-style quantifications to produce explanations that reflect both necessity ($E_N$) and sufficiency ($E_S$). SUNY-feature and SUNY-filter generate 2D saliency maps, enabling more informative and robust interpretations than existing CAM/perturbation-based methods. Extensive experiments on ILSVRC2012 and CUB-200-2011 across multiple architectures show improved semantic fidelity, robustness to perturbations, and localization accuracy, while passing sanity checks. The approach offers a practical, interpretable lens on CNN decisions with potential extensions to segmentation and vision-language tasks, underscoring the value of integrating causality into visual explanations.

Abstract

Researchers have proposed various methods for visually interpreting the Convolutional Neural Network (CNN) via saliency maps, which include Class-Activation-Map (CAM) based approaches as a leading family. However, in terms of the internal design logic, existing CAM-based approaches often overlook the causal perspective that answers the core "why" question to help humans understand the explanation. Additionally, current CNN explanations lack the consideration of both necessity and sufficiency, two complementary sides of a desirable explanation. This paper presents a causality-driven framework, SUNY, designed to rationalize the explanations toward better human understanding. Using the CNN model's input features or internal filters as hypothetical causes, SUNY generates explanations by bi-directional quantifications on both the necessary and sufficient perspectives. Extensive evaluations justify that SUNY not only produces more informative and convincing explanations from the angles of necessity and sufficiency, but also achieves performances competitive to other approaches across different CNN architectures over large-scale datasets, including ILSVRC2012 and CUB-200-2011.

SUNY: A Visual Interpretation Framework for Convolutional Neural Networks from a Necessary and Sufficient Perspective

TL;DR

This paper addresses the problem of explaining CNN decisions by embedding causal reasoning into visual explanations. The SUNY framework treats either input features or internal filters as hypothetical causes and uses bi-directional N-S Shapley-style quantifications to produce explanations that reflect both necessity () and sufficiency (). SUNY-feature and SUNY-filter generate 2D saliency maps, enabling more informative and robust interpretations than existing CAM/perturbation-based methods. Extensive experiments on ILSVRC2012 and CUB-200-2011 across multiple architectures show improved semantic fidelity, robustness to perturbations, and localization accuracy, while passing sanity checks. The approach offers a practical, interpretable lens on CNN decisions with potential extensions to segmentation and vision-language tasks, underscoring the value of integrating causality into visual explanations.

Abstract

Researchers have proposed various methods for visually interpreting the Convolutional Neural Network (CNN) via saliency maps, which include Class-Activation-Map (CAM) based approaches as a leading family. However, in terms of the internal design logic, existing CAM-based approaches often overlook the causal perspective that answers the core "why" question to help humans understand the explanation. Additionally, current CNN explanations lack the consideration of both necessity and sufficiency, two complementary sides of a desirable explanation. This paper presents a causality-driven framework, SUNY, designed to rationalize the explanations toward better human understanding. Using the CNN model's input features or internal filters as hypothetical causes, SUNY generates explanations by bi-directional quantifications on both the necessary and sufficient perspectives. Extensive evaluations justify that SUNY not only produces more informative and convincing explanations from the angles of necessity and sufficiency, but also achieves performances competitive to other approaches across different CNN architectures over large-scale datasets, including ILSVRC2012 and CUB-200-2011.
Paper Structure (14 sections, 4 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: An example of SUNY explanations. SUNY highlights sufficient and necessary input regions w.r.t. the model's prediction towards the target class. The 2D saliency map is the first-of-its-kind visual explanation design to the best of our knowledge.
  • Figure 2: Overview of SUNY framework. Phase(a) is a forward pass of input image $I$ through a CNN model, where the prediction probability of the target class is $p_{c}(F)$. Phase(b)-(e) present the generation of SUNY-feature (green) and SUNY-filter (blue), respectively, referring to different types of hypothesized causes. Note that they are not simultaneous processes. In Phase(b), we obtain filters and feature maps of a specified layer, and intervene on model filters or the corresponding input features. We get new prediction probabilities after the intervention and calculate N-S Effect, $E_N$, $E_S$ in Phase(c), which are fed back to Phase(b) to construct hypothesized cause sets $F_{hypN}$ and $F_{hypS}$. Through intervening on $F_{hypN}$ and $F_{hypS}$ (Phase(b)), we can obtain $E_N$, $E_S$ (Phase(c)) and N-S Responsibilities$R_N$ and $R_S$ (Phase(d)), which are weights for the linear combination of feature maps. The saliency maps are generated in Phase(e), where we show SUNY-feature results as an example. Implementation details are included in Sec. \ref{['sec:solution']}.
  • Figure 3: Visual comparison of saliency maps from different methods. The first row: a VGG16 trained on CUB-200-2011, and the image is correctly predicted as Gull. The second row: a VGG16 trained on ILSVRC2012, and the image is correctly predicted as Dog-sled.
  • Figure 4: Semantic evaluation of SUNY explanations for a VGG16 trained on CUB for bird species classification. The bird images in the first row are from four bird species belonging to two families and the correct/incorrect predictions are marked by and , respectively. For the two images marked by , the model mistakes the actual species with the other species under the same family. Each column corresponds to one image; the second and third rows: sufficiency and necessity heatmaps. The small image in the bottom corner of each heatmap presents the highlighted image portion.
  • Figure 5: Comparison of SUNY with seven visual explanation methods in terms of the N-S Quantification metric.
  • ...and 1 more figures