Table of Contents
Fetching ...

CNN Explainability with Multivector Tucker Saliency Maps for Self-Supervised Models

Aymene Mohammed Bouayed, Samuel Deslauriers-Gauthier, Adrian Iaccovelli, David Naccache

TL;DR

The Tucker Saliency Map (TSM) method is introduced, which applies Tucker tensor decomposition to better capture the inherent structure of feature maps, producing more accurate singular vectors and values.

Abstract

Interpreting the decisions of Convolutional Neural Networks (CNNs) is essential for understanding their behavior, yet explainability remains a significant challenge, particularly for self-supervised models. Most existing methods for generating saliency maps rely on ground truth labels, restricting their use to supervised tasks. EigenCAM is the only notable label-independent alternative, leveraging Singular Value Decomposition to generate saliency maps applicable across CNN models, but it does not fully exploit the tensorial structure of feature maps. In this work, we introduce the Tucker Saliency Map (TSM) method, which applies Tucker tensor decomposition to better capture the inherent structure of feature maps, producing more accurate singular vectors and values. These are used to generate high-fidelity saliency maps, effectively highlighting objects of interest in the input. We further extend EigenCAM and TSM into multivector variants -Multivec-EigenCAM and Multivector Tucker Saliency Maps (MTSM)- which utilize all singular vectors and values, further improving saliency map quality. Quantitative evaluations on supervised classification models demonstrate that TSM, Multivec-EigenCAM, and MTSM achieve competitive performance with label-dependent methods. Moreover, TSM enhances explainability by approximately 50% over EigenCAM for both supervised and self-supervised models. Multivec-EigenCAM and MTSM further advance state-of-the-art explainability performance on self-supervised models, with MTSM achieving the best results.

CNN Explainability with Multivector Tucker Saliency Maps for Self-Supervised Models

TL;DR

The Tucker Saliency Map (TSM) method is introduced, which applies Tucker tensor decomposition to better capture the inherent structure of feature maps, producing more accurate singular vectors and values.

Abstract

Interpreting the decisions of Convolutional Neural Networks (CNNs) is essential for understanding their behavior, yet explainability remains a significant challenge, particularly for self-supervised models. Most existing methods for generating saliency maps rely on ground truth labels, restricting their use to supervised tasks. EigenCAM is the only notable label-independent alternative, leveraging Singular Value Decomposition to generate saliency maps applicable across CNN models, but it does not fully exploit the tensorial structure of feature maps. In this work, we introduce the Tucker Saliency Map (TSM) method, which applies Tucker tensor decomposition to better capture the inherent structure of feature maps, producing more accurate singular vectors and values. These are used to generate high-fidelity saliency maps, effectively highlighting objects of interest in the input. We further extend EigenCAM and TSM into multivector variants -Multivec-EigenCAM and Multivector Tucker Saliency Maps (MTSM)- which utilize all singular vectors and values, further improving saliency map quality. Quantitative evaluations on supervised classification models demonstrate that TSM, Multivec-EigenCAM, and MTSM achieve competitive performance with label-dependent methods. Moreover, TSM enhances explainability by approximately 50% over EigenCAM for both supervised and self-supervised models. Multivec-EigenCAM and MTSM further advance state-of-the-art explainability performance on self-supervised models, with MTSM achieving the best results.

Paper Structure

This paper contains 29 sections, 6 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: Proposed TSM and MTSM methods. Firstly, the feature map tensor $\mathcal{F}$ is extracted from the output of a convolutional layer (preferably the last convolutional layer, as it has the widest field of view). Then, the Tucker tensor decomposition hooi is performed on $\mathcal{F}$. This decomposition results in a core tensor $\mathcal{C}$ and three matrices containing orthonormal singular vectors for each dimension, notably $A^{(1)}$, $A^{(2)}$, and $A^{(3)}$. The matrix $A^{(1)}$ and the singular values are appropriately used to calculate the TSM (See Equation \ref{['tuckercam_eq']}) and the MTSM (See Equation \ref{['multivec_eq']}). Finally, the absolute value function is applied per element to the resulting matrix and the values are renormalized to be in the range $[0;1]$.
  • Figure 2: Boxplot comparing the distribution of the first singular values divided by the sum of all singular values per tensor in both SVD decomposition and Tucker decomposition. These singular values are obtained from the decomposition of all the feature map tensors produced by five different self-supervised models on the Pascal VOC pascal validation dataset.
  • Figure 3: Boxplot representing the distribution of the first five singular values in both SVD decomposition and Tucker decomposition divided by the sum of all singular values per tensor. These singular values are obtained from the decomposition of all the feature map tensors produced by the Moco V2 moco and VicRegL ConvNext vicregl models on the Pascal VOC validation dataset.
  • Figure 4: Qualitative comparison of saliency map-based methods on the Resnet50 model.
  • Figure 5: Qualitative comparison of saliency maps extracted using the EigenCAM versus TSM on the VicRegL Resnet50 model applied to the ImageNet dataset.
  • ...and 15 more figures