Table of Contents
Fetching ...

How to Squeeze An Explanation Out of Your Model

Tiago Roxo, Joana C. Costa, Pedro R. M. Inácio, Hugo Proença

TL;DR

Deep learning models often lack interpretability, especially in non-image settings. The paper introduces a model-agnostic approach that inserts a Squeeze-and-Excitation (SE) block before the classifier to derive heatmaps from channel-wise SE weights, enabling visual explanations across image and video/multi-modal data without sacrificing accuracy. Results show SE-based interpretability yields reliable, competitive heatmaps compared with GradCAM variants on standard object datasets and extends to biometric contexts like CelebA facial attributes and Active Speaker Detection. The method is simple to integrate across architectures and data modalities, offering practical interpretability with low overhead in sensitive applications such as biometrics and behavioral analysis.

Abstract

Deep learning models are widely used nowadays for their reliability in performing various tasks. However, they do not typically provide the reasoning behind their decision, which is a significant drawback, particularly for more sensitive areas such as biometrics, security and healthcare. The most commonly used approaches to provide interpretability create visual attention heatmaps of regions of interest on an image based on models gradient backpropagation. Although this is a viable approach, current methods are targeted toward image settings and default/standard deep learning models, meaning that they require significant adaptations to work on video/multi-modal settings and custom architectures. This paper proposes an approach for interpretability that is model-agnostic, based on a novel use of the Squeeze and Excitation (SE) block that creates visual attention heatmaps. By including an SE block prior to the classification layer of any model, we are able to retrieve the most influential features via SE vector manipulation, one of the key components of the SE block. Our results show that this new SE-based interpretability can be applied to various models in image and video/multi-modal settings, namely biometrics of facial features with CelebA and behavioral biometrics using Active Speaker Detection datasets. Furthermore, our proposal does not compromise model performance toward the original task, and has competitive results with current interpretability approaches in state-of-the-art object datasets, highlighting its robustness to perform in varying data aside from the biometric context.

How to Squeeze An Explanation Out of Your Model

TL;DR

Deep learning models often lack interpretability, especially in non-image settings. The paper introduces a model-agnostic approach that inserts a Squeeze-and-Excitation (SE) block before the classifier to derive heatmaps from channel-wise SE weights, enabling visual explanations across image and video/multi-modal data without sacrificing accuracy. Results show SE-based interpretability yields reliable, competitive heatmaps compared with GradCAM variants on standard object datasets and extends to biometric contexts like CelebA facial attributes and Active Speaker Detection. The method is simple to integrate across architectures and data modalities, offering practical interpretability with low overhead in sensitive applications such as biometrics and behavioral analysis.

Abstract

Deep learning models are widely used nowadays for their reliability in performing various tasks. However, they do not typically provide the reasoning behind their decision, which is a significant drawback, particularly for more sensitive areas such as biometrics, security and healthcare. The most commonly used approaches to provide interpretability create visual attention heatmaps of regions of interest on an image based on models gradient backpropagation. Although this is a viable approach, current methods are targeted toward image settings and default/standard deep learning models, meaning that they require significant adaptations to work on video/multi-modal settings and custom architectures. This paper proposes an approach for interpretability that is model-agnostic, based on a novel use of the Squeeze and Excitation (SE) block that creates visual attention heatmaps. By including an SE block prior to the classification layer of any model, we are able to retrieve the most influential features via SE vector manipulation, one of the key components of the SE block. Our results show that this new SE-based interpretability can be applied to various models in image and video/multi-modal settings, namely biometrics of facial features with CelebA and behavioral biometrics using Active Speaker Detection datasets. Furthermore, our proposal does not compromise model performance toward the original task, and has competitive results with current interpretability approaches in state-of-the-art object datasets, highlighting its robustness to perform in varying data aside from the biometric context.

Paper Structure

This paper contains 15 sections, 3 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of the inclusion of SE blocks for model-agnostic interpretability. Two different settings are depicted (multi-modal, on the left, and image, on the right) where SE block is similarly included in feature extractors to output attention heatmaps of models predictions. Heatmaps are created using channel features of the respective top 10% SE vector values, via channel interpolation and combination with the original image. Audio encoding of the multi-modal setting (ASD) is not displayed for simplicity.
  • Figure 2: Overview of a Squeeze-and-Excitation block, where input U is squeezed, for each channel ($u_c$, in (1)), via average pooling through spatial dimensions $H\times W$ to output a vector of weight importance ($\mathbf{s}$, in (2)), used to highlight the most important feature via channel-wise multiplication of weight importance vector and respective channel of the inputted feature map. Retrieved from the original paper hu2018squeeze.
  • Figure 3: SE value distribution of all considered models for the datasets CIFAR-10, CIFAR-100, CelebA, AVA, and WASD (from left to right). SE values are normalized with mean to 0.
  • Figure 4: Inclusion of SE block in ResNet50 (left), and overview of how standard ASD models perform (right).
  • Figure 5: SE Interpretability (last column per original image) relative to other standard visual interpretability approaches: GradCAM, GradCAM++, EigenGrad, and FullGrad. For each original image the top row relates to interpretability of ResNet50 and the bottom to InceptionV3.
  • ...and 4 more figures