How to Squeeze An Explanation Out of Your Model

Tiago Roxo; Joana C. Costa; Pedro R. M. Inácio; Hugo Proença

How to Squeeze An Explanation Out of Your Model

Tiago Roxo, Joana C. Costa, Pedro R. M. Inácio, Hugo Proença

TL;DR

Deep learning models often lack interpretability, especially in non-image settings. The paper introduces a model-agnostic approach that inserts a Squeeze-and-Excitation (SE) block before the classifier to derive heatmaps from channel-wise SE weights, enabling visual explanations across image and video/multi-modal data without sacrificing accuracy. Results show SE-based interpretability yields reliable, competitive heatmaps compared with GradCAM variants on standard object datasets and extends to biometric contexts like CelebA facial attributes and Active Speaker Detection. The method is simple to integrate across architectures and data modalities, offering practical interpretability with low overhead in sensitive applications such as biometrics and behavioral analysis.

Abstract

Deep learning models are widely used nowadays for their reliability in performing various tasks. However, they do not typically provide the reasoning behind their decision, which is a significant drawback, particularly for more sensitive areas such as biometrics, security and healthcare. The most commonly used approaches to provide interpretability create visual attention heatmaps of regions of interest on an image based on models gradient backpropagation. Although this is a viable approach, current methods are targeted toward image settings and default/standard deep learning models, meaning that they require significant adaptations to work on video/multi-modal settings and custom architectures. This paper proposes an approach for interpretability that is model-agnostic, based on a novel use of the Squeeze and Excitation (SE) block that creates visual attention heatmaps. By including an SE block prior to the classification layer of any model, we are able to retrieve the most influential features via SE vector manipulation, one of the key components of the SE block. Our results show that this new SE-based interpretability can be applied to various models in image and video/multi-modal settings, namely biometrics of facial features with CelebA and behavioral biometrics using Active Speaker Detection datasets. Furthermore, our proposal does not compromise model performance toward the original task, and has competitive results with current interpretability approaches in state-of-the-art object datasets, highlighting its robustness to perform in varying data aside from the biometric context.

How to Squeeze An Explanation Out of Your Model

TL;DR

Abstract

How to Squeeze An Explanation Out of Your Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)