Table of Contents
Fetching ...

Edge Attention Module for Object Classification

Santanu Roy, Ashvath Suresh, Archit Gupta

TL;DR

This work introduces the Edge Attention Module (EAM), a novel spatial attention mechanism built on a Max-Min pooling operation that isolates edge information to aid object classification. By attaching EAM (and optionally a second EAM, 2EAM) in parallel to pre-trained CNN backbones, the framework emphasizes boundary features and accelerates convergence, achieving state-of-the-art results on Caltech-101/256 and strong gains on CIFAR-100 and Tiny ImageNet-200 versus recent attention and pooling-based models. Extensive experiments, including 5-fold cross-validation and Grad-CAM analysis, validate that EAM directs model focus toward object edges, improving accuracy and robustness across diverse architectures. The authors also outline a flexible design, trade-offs in complexity, and plans to extend edge-focused modules to Vision Transformers (ViT) for broader impact.

Abstract

A novel ``edge attention-based Convolutional Neural Network (CNN)'' is proposed in this research for object classification task. With the advent of advanced computing technology, CNN models have achieved to remarkable success, particularly in computer vision applications. Nevertheless, the efficacy of the conventional CNN is often hindered due to class imbalance and inter-class similarity problems, which are particularly prominent in the computer vision field. In this research, we introduce for the first time an ``Edge Attention Module (EAM)'' consisting of a Max-Min pooling layer, followed by convolutional layers. This Max-Min pooling is entirely a novel pooling technique, specifically designed to capture only the edge information that is crucial for any object classification task. Therefore, by integrating this novel pooling technique into the attention module, the CNN network inherently prioritizes on essential edge features, thereby boosting the accuracy and F1-score of the model significantly. We have implemented our proposed EAM or 2EAMs on several standard pre-trained CNN models for Caltech-101, Caltech-256, CIFAR-100 and Tiny ImageNet-200 datasets. The extensive experiments reveal that our proposed framework (that is, EAM with CNN and 2EAMs with CNN), outperforms all pre-trained CNN models as well as recent trend models ``Pooling-based Vision Transformer (PiT)'', ``Convolutional Block Attention Module (CBAM)'', and ConvNext, by substantial margins. We have achieved the accuracy of 95.5% and 86% by the proposed framework on Caltech-101 and Caltech-256 datasets, respectively. So far, this is the best results on these datasets, to the best of our knowledge.

Edge Attention Module for Object Classification

TL;DR

This work introduces the Edge Attention Module (EAM), a novel spatial attention mechanism built on a Max-Min pooling operation that isolates edge information to aid object classification. By attaching EAM (and optionally a second EAM, 2EAM) in parallel to pre-trained CNN backbones, the framework emphasizes boundary features and accelerates convergence, achieving state-of-the-art results on Caltech-101/256 and strong gains on CIFAR-100 and Tiny ImageNet-200 versus recent attention and pooling-based models. Extensive experiments, including 5-fold cross-validation and Grad-CAM analysis, validate that EAM directs model focus toward object edges, improving accuracy and robustness across diverse architectures. The authors also outline a flexible design, trade-offs in complexity, and plans to extend edge-focused modules to Vision Transformers (ViT) for broader impact.

Abstract

A novel ``edge attention-based Convolutional Neural Network (CNN)'' is proposed in this research for object classification task. With the advent of advanced computing technology, CNN models have achieved to remarkable success, particularly in computer vision applications. Nevertheless, the efficacy of the conventional CNN is often hindered due to class imbalance and inter-class similarity problems, which are particularly prominent in the computer vision field. In this research, we introduce for the first time an ``Edge Attention Module (EAM)'' consisting of a Max-Min pooling layer, followed by convolutional layers. This Max-Min pooling is entirely a novel pooling technique, specifically designed to capture only the edge information that is crucial for any object classification task. Therefore, by integrating this novel pooling technique into the attention module, the CNN network inherently prioritizes on essential edge features, thereby boosting the accuracy and F1-score of the model significantly. We have implemented our proposed EAM or 2EAMs on several standard pre-trained CNN models for Caltech-101, Caltech-256, CIFAR-100 and Tiny ImageNet-200 datasets. The extensive experiments reveal that our proposed framework (that is, EAM with CNN and 2EAMs with CNN), outperforms all pre-trained CNN models as well as recent trend models ``Pooling-based Vision Transformer (PiT)'', ``Convolutional Block Attention Module (CBAM)'', and ConvNext, by substantial margins. We have achieved the accuracy of 95.5% and 86% by the proposed framework on Caltech-101 and Caltech-256 datasets, respectively. So far, this is the best results on these datasets, to the best of our knowledge.

Paper Structure

This paper contains 10 sections, 18 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Proposed Frameworks with Edge Attention Module (EAM): a. EAM + pre-trained CNN, b. 2EAM + pre-trained CNN, c. Operation of Max-Min pooling, in the lower image the first row represents the original image and second row represents the processed image after passing the original images through Max-Min pooling layer
  • Figure 3: Graph comparison of several frameworks: Inception-V3 (blue color), Inception-V3 +EAM (red-color), Inception-V3 +2EAM (yellow color) on Caltech-256 dataset; (a). The graph of training Accuracy vs number of epochs, (b). The graph of training loss vs number of epochs
  • Figure 4: Performance comparison of numerous models with and without EAM on Caltech-101 dataset, all the codes are shared in Github repository for proof
  • Figure 5: Comparisons of Gradcam heatmaps: $1^{st}$ column of every image represents original images from CIFAR-100 dataset, $2^{nd}$ column represents Gradcam heatmap of Inception-V3 without EAM, $3^{rd}$ column represents Gradcam heatmap of Inception-V3 with 2EAM
  • Figure 6: $1^{st}$ column represents original images from Caltech dataset, $2^{nd}$ column represents Gradcam heatmap of DesnseNet without EAM, $3^{rd}$ column represents Gradcam heatmap of DenseNet with EAM