Table of Contents
Fetching ...

ELA: Efficient Local Attention for Deep Convolutional Neural Networks

Wei Xu, Yi Wan

TL;DR

This paper introduces an Efficient Local Attention (ELA) method that achieves substantial performance improvements with a simple structure and proposes the incorporation of 1D convolution and Group Normalization feature enhancement techniques.

Abstract

The attention mechanism has gained significant recognition in the field of computer vision due to its ability to effectively enhance the performance of deep neural networks. However, existing methods often struggle to effectively utilize spatial information or, if they do, they come at the cost of reducing channel dimensions or increasing the complexity of neural networks. In order to address these limitations, this paper introduces an Efficient Local Attention (ELA) method that achieves substantial performance improvements with a simple structure. By analyzing the limitations of the Coordinate Attention method, we identify the lack of generalization ability in Batch Normalization, the adverse effects of dimension reduction on channel attention, and the complexity of attention generation process. To overcome these challenges, we propose the incorporation of 1D convolution and Group Normalization feature enhancement techniques. This approach enables accurate localization of regions of interest by efficiently encoding two 1D positional feature maps without the need for dimension reduction, while allowing for a lightweight implementation. We carefully design three hyperparameters in ELA, resulting in four different versions: ELA-T, ELA-B, ELA-S, and ELA-L, to cater to the specific requirements of different visual tasks such as image classification, object detection and sementic segmentation. ELA can be seamlessly integrated into deep CNN networks such as ResNet, MobileNet, and DeepLab. Extensive evaluations on the ImageNet, MSCOCO, and Pascal VOC datasets demonstrate the superiority of the proposed ELA module over current state-of-the-art methods in all three aforementioned visual tasks.

ELA: Efficient Local Attention for Deep Convolutional Neural Networks

TL;DR

This paper introduces an Efficient Local Attention (ELA) method that achieves substantial performance improvements with a simple structure and proposes the incorporation of 1D convolution and Group Normalization feature enhancement techniques.

Abstract

The attention mechanism has gained significant recognition in the field of computer vision due to its ability to effectively enhance the performance of deep neural networks. However, existing methods often struggle to effectively utilize spatial information or, if they do, they come at the cost of reducing channel dimensions or increasing the complexity of neural networks. In order to address these limitations, this paper introduces an Efficient Local Attention (ELA) method that achieves substantial performance improvements with a simple structure. By analyzing the limitations of the Coordinate Attention method, we identify the lack of generalization ability in Batch Normalization, the adverse effects of dimension reduction on channel attention, and the complexity of attention generation process. To overcome these challenges, we propose the incorporation of 1D convolution and Group Normalization feature enhancement techniques. This approach enables accurate localization of regions of interest by efficiently encoding two 1D positional feature maps without the need for dimension reduction, while allowing for a lightweight implementation. We carefully design three hyperparameters in ELA, resulting in four different versions: ELA-T, ELA-B, ELA-S, and ELA-L, to cater to the specific requirements of different visual tasks such as image classification, object detection and sementic segmentation. ELA can be seamlessly integrated into deep CNN networks such as ResNet, MobileNet, and DeepLab. Extensive evaluations on the ImageNet, MSCOCO, and Pascal VOC datasets demonstrate the superiority of the proposed ELA module over current state-of-the-art methods in all three aforementioned visual tasks.
Paper Structure (16 sections, 9 equations, 4 figures, 8 tables)

This paper contains 16 sections, 9 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Performance comparison of multiple attention modules ( SA-Net zhang2021sa, ECA-Net wang2020eca, SEsenet, CAhou2021coordinate, CBAM woo2018cbam, and ELA) compared on three computer vision tasks.The y-axis labels from left to right are top-1 accuracy, AP and mean IoU, respectively. In the plot, “Mbv2” denotes MobileNetV2, “YX-N” denotes YOLOX-Nano, and “DLv3” represents DeepLabV3. Clearly, our approach demonstrates superior performance not only in ImageNet classification but also in VOC object detection and VOC semantic segmentation.
  • Figure 2: The schematic diagrams of Efficient Local Attention (ELA) (c) are compared with SE block senet (a) and Coordinate Attention (CA) hou2021coordinate (b). “X Avg Pool” and “Y Avg Pool” represent one-dimensional horizontal global pooling and one-dimensional vertical global pooling, respectively. From a structural perspective, the ELA appears to be significantly more lightweight compared to the CA, while also avoiding dimension reduction in the channel dimension.
  • Figure 3: The visualization examples generated by GradCAM 2017grad depict the use of “layer4.2” across all target layers. The results clearly demonstrate that our localization attention module (ELA) effectively localizes the objects of interest with a high level of accuracy.
  • Figure 4: PyTorch code for our proposed ELA-B module