CA-Stream: Attention-based pooling for interpretable image recognition
Felipe Torres, Hanwei Zhang, Ronan Sicre, Stéphane Ayache, Yannis Avrithis
TL;DR
This work addresses the interpretability gap in vision models by linking CAM-based saliency with transformer-style attention. It introduces Cross-Attention Stream (CA-Stream), a parallel pooling mechanism that replaces GAP with a learnable, attention-driven aggregation yielding a global representation $\mathbf{q}_{L+1}$. The key contributions are (i) revealing that cross-attention pooling acts as a class-agnostic CAM, (ii) designing and integrating CA-Stream with CNN backbones to enhance post-hoc explanations, and (iii) demonstrating improved interpretability metrics on ImageNet while preserving recognition accuracy. The result is a practical pathway to more transparent vision systems by embedding explanation-compatible pooling directly into inference.
Abstract
Explanations obtained from transformer-based architectures in the form of raw attention, can be seen as a class-agnostic saliency map. Additionally, attention-based pooling serves as a form of masking the in feature space. Motivated by this observation, we design an attention-based pooling mechanism intended to replace Global Average Pooling (GAP) at inference. This mechanism, called Cross-Attention Stream (CA-Stream), comprises a stream of cross attention blocks interacting with features at different network depths. CA-Stream enhances interpretability in models, while preserving recognition performance.
