SUNY: A Visual Interpretation Framework for Convolutional Neural Networks from a Necessary and Sufficient Perspective
Xiwei Xuan, Ziquan Deng, Hsuan-Tien Lin, Zhaodan Kong, Kwan-Liu Ma
TL;DR
This paper addresses the problem of explaining CNN decisions by embedding causal reasoning into visual explanations. The SUNY framework treats either input features or internal filters as hypothetical causes and uses bi-directional N-S Shapley-style quantifications to produce explanations that reflect both necessity ($E_N$) and sufficiency ($E_S$). SUNY-feature and SUNY-filter generate 2D saliency maps, enabling more informative and robust interpretations than existing CAM/perturbation-based methods. Extensive experiments on ILSVRC2012 and CUB-200-2011 across multiple architectures show improved semantic fidelity, robustness to perturbations, and localization accuracy, while passing sanity checks. The approach offers a practical, interpretable lens on CNN decisions with potential extensions to segmentation and vision-language tasks, underscoring the value of integrating causality into visual explanations.
Abstract
Researchers have proposed various methods for visually interpreting the Convolutional Neural Network (CNN) via saliency maps, which include Class-Activation-Map (CAM) based approaches as a leading family. However, in terms of the internal design logic, existing CAM-based approaches often overlook the causal perspective that answers the core "why" question to help humans understand the explanation. Additionally, current CNN explanations lack the consideration of both necessity and sufficiency, two complementary sides of a desirable explanation. This paper presents a causality-driven framework, SUNY, designed to rationalize the explanations toward better human understanding. Using the CNN model's input features or internal filters as hypothetical causes, SUNY generates explanations by bi-directional quantifications on both the necessary and sufficient perspectives. Extensive evaluations justify that SUNY not only produces more informative and convincing explanations from the angles of necessity and sufficiency, but also achieves performances competitive to other approaches across different CNN architectures over large-scale datasets, including ILSVRC2012 and CUB-200-2011.
