Table of Contents
Fetching ...

CNNtention: Can CNNs do better with Attention?

Nikhil Kapila, Julian Glattki, Tejas Rathi

TL;DR

This work investigates whether attention-augmented CNNs can outperform traditional CNNs on image classification while managing computational cost. By integrating three attention mechanisms—SelfAtt, Multi-Head Attention, and CBAM—into a ResNet20 baseline and evaluating on CIFAR-10 and MNIST, the study reveals that SelfAtt and MHA provide better global context modeling and accuracy, with CBAM offering faster convergence at times but slightly lower final performance. The authors emphasize a design choice where attention is applied sparingly (three placements) and after feature extractors rather than after every convolution, balancing performance gains against overhead. Overall, the results support the viability of attention-augmented CNNs, offering practical guidance on when to prefer different attention variants based on resource constraints and deployment scenarios.

Abstract

Convolutional Neural Networks (CNNs) have been the standard for image classification tasks for a long time, but more recently attention-based mechanisms have gained traction. This project aims to compare traditional CNNs with attention-augmented CNNs across an image classification task. By evaluating and comparing their performance, accuracy and computational efficiency, the project will highlight benefits and trade-off of the localized feature extraction of traditional CNNs and the global context capture in attention-augmented CNNs. By doing this, we can reveal further insights into their respective strengths and weaknesses, guide the selection of models based on specific application needs and ultimately, enhance understanding of these architectures in the deep learning community. This was our final project for CS7643 Deep Learning course at Georgia Tech.

CNNtention: Can CNNs do better with Attention?

TL;DR

This work investigates whether attention-augmented CNNs can outperform traditional CNNs on image classification while managing computational cost. By integrating three attention mechanisms—SelfAtt, Multi-Head Attention, and CBAM—into a ResNet20 baseline and evaluating on CIFAR-10 and MNIST, the study reveals that SelfAtt and MHA provide better global context modeling and accuracy, with CBAM offering faster convergence at times but slightly lower final performance. The authors emphasize a design choice where attention is applied sparingly (three placements) and after feature extractors rather than after every convolution, balancing performance gains against overhead. Overall, the results support the viability of attention-augmented CNNs, offering practical guidance on when to prefer different attention variants based on resource constraints and deployment scenarios.

Abstract

Convolutional Neural Networks (CNNs) have been the standard for image classification tasks for a long time, but more recently attention-based mechanisms have gained traction. This project aims to compare traditional CNNs with attention-augmented CNNs across an image classification task. By evaluating and comparing their performance, accuracy and computational efficiency, the project will highlight benefits and trade-off of the localized feature extraction of traditional CNNs and the global context capture in attention-augmented CNNs. By doing this, we can reveal further insights into their respective strengths and weaknesses, guide the selection of models based on specific application needs and ultimately, enhance understanding of these architectures in the deep learning community. This was our final project for CS7643 Deep Learning course at Georgia Tech.

Paper Structure

This paper contains 24 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: CIFAR-10 and MNIST ResNet re-implementation training/testing error.
  • Figure 2: Overall architecture with attention added, the Feature Extractor layers are sequential blocks of convolutions that can be seen in cnntention-repo. Residual connections are not shown but exist between each feature extractor and attention layer as seen in \ref{['fig:att-augmented']}.
  • Figure 3: Self Attention module introduced in Self-Attention GANs. selfatt-gans.
  • Figure 4: CBAM module introduced in CBAM.
  • Figure 5: Stable training with (right) and without (left) residual connections between self-attention blocks.
  • ...and 6 more figures