Edge Attention Module for Object Classification
Santanu Roy, Ashvath Suresh, Archit Gupta
TL;DR
This work introduces the Edge Attention Module (EAM), a novel spatial attention mechanism built on a Max-Min pooling operation that isolates edge information to aid object classification. By attaching EAM (and optionally a second EAM, 2EAM) in parallel to pre-trained CNN backbones, the framework emphasizes boundary features and accelerates convergence, achieving state-of-the-art results on Caltech-101/256 and strong gains on CIFAR-100 and Tiny ImageNet-200 versus recent attention and pooling-based models. Extensive experiments, including 5-fold cross-validation and Grad-CAM analysis, validate that EAM directs model focus toward object edges, improving accuracy and robustness across diverse architectures. The authors also outline a flexible design, trade-offs in complexity, and plans to extend edge-focused modules to Vision Transformers (ViT) for broader impact.
Abstract
A novel ``edge attention-based Convolutional Neural Network (CNN)'' is proposed in this research for object classification task. With the advent of advanced computing technology, CNN models have achieved to remarkable success, particularly in computer vision applications. Nevertheless, the efficacy of the conventional CNN is often hindered due to class imbalance and inter-class similarity problems, which are particularly prominent in the computer vision field. In this research, we introduce for the first time an ``Edge Attention Module (EAM)'' consisting of a Max-Min pooling layer, followed by convolutional layers. This Max-Min pooling is entirely a novel pooling technique, specifically designed to capture only the edge information that is crucial for any object classification task. Therefore, by integrating this novel pooling technique into the attention module, the CNN network inherently prioritizes on essential edge features, thereby boosting the accuracy and F1-score of the model significantly. We have implemented our proposed EAM or 2EAMs on several standard pre-trained CNN models for Caltech-101, Caltech-256, CIFAR-100 and Tiny ImageNet-200 datasets. The extensive experiments reveal that our proposed framework (that is, EAM with CNN and 2EAMs with CNN), outperforms all pre-trained CNN models as well as recent trend models ``Pooling-based Vision Transformer (PiT)'', ``Convolutional Block Attention Module (CBAM)'', and ConvNext, by substantial margins. We have achieved the accuracy of 95.5% and 86% by the proposed framework on Caltech-101 and Caltech-256 datasets, respectively. So far, this is the best results on these datasets, to the best of our knowledge.
