STEAM: Squeeze and Transform Enhanced Attention Module
Rishabh Sabharwal, Ram Samarth B B, Parikshit Singh Rathore, Punit Rathore
TL;DR
STEAM introduces a constant-parameter, graph-based dual-attention module that jointly models channel and spatial dependencies in CNNs. By decomposing attention into Channel Interaction Attention (CIA) and Spatial Interaction Attention (SIA) with Output Guided Pooling (OGP), and leveraging multi-head graph transformers, STEAM achieves competitive efficiency and significant accuracy gains across ImageNet-1K and MS COCO benchmarks. The method demonstrates strong improvements over existing channel and spatial attention modules (e.g., SE, CBAM, ECA, GCT, MCA) while adding only a small parameter footprint (e.g., $8d$ parameters per STEAM unit; for $d=8$, about $320$ extra parameters in ResNet-50) and minimal GFLOPs increases, or even reductions when compared to prior modules. Its backbone-agnostic design enables effective augmentation of both deep and lightweight networks, with adaptive placement of STEAM units per stage contributing to performance gains in diverse architectures.
Abstract
Channel and spatial attention mechanisms introduced by earlier works enhance the representation abilities of deep convolutional neural networks (CNNs) but often lead to increased parameter and computation costs. While recent approaches focus solely on efficient feature context modeling for channel attention, we aim to model both channel and spatial attention comprehensively with minimal parameters and reduced computation. Leveraging the principles of relational modeling in graphs, we introduce a constant-parameter module, STEAM: Squeeze and Transform Enhanced Attention Module, which integrates channel and spatial attention to enhance the representation power of CNNs. To our knowledge, we are the first to propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers. Additionally, we introduce Output Guided Pooling (OGP), which efficiently captures spatial context to further enhance spatial attention. We extensively evaluate STEAM for large-scale image classification, object detection and instance segmentation on standard benchmark datasets. STEAM achieves a 2% increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs. Furthermore, STEAM outperforms leading modules ECA and GCT in terms of accuracy while achieving a three-fold reduction in GFLOPs.
