CNNtention: Can CNNs do better with Attention?
Nikhil Kapila, Julian Glattki, Tejas Rathi
TL;DR
This work investigates whether attention-augmented CNNs can outperform traditional CNNs on image classification while managing computational cost. By integrating three attention mechanisms—SelfAtt, Multi-Head Attention, and CBAM—into a ResNet20 baseline and evaluating on CIFAR-10 and MNIST, the study reveals that SelfAtt and MHA provide better global context modeling and accuracy, with CBAM offering faster convergence at times but slightly lower final performance. The authors emphasize a design choice where attention is applied sparingly (three placements) and after feature extractors rather than after every convolution, balancing performance gains against overhead. Overall, the results support the viability of attention-augmented CNNs, offering practical guidance on when to prefer different attention variants based on resource constraints and deployment scenarios.
Abstract
Convolutional Neural Networks (CNNs) have been the standard for image classification tasks for a long time, but more recently attention-based mechanisms have gained traction. This project aims to compare traditional CNNs with attention-augmented CNNs across an image classification task. By evaluating and comparing their performance, accuracy and computational efficiency, the project will highlight benefits and trade-off of the localized feature extraction of traditional CNNs and the global context capture in attention-augmented CNNs. By doing this, we can reveal further insights into their respective strengths and weaknesses, guide the selection of models based on specific application needs and ultimately, enhance understanding of these architectures in the deep learning community. This was our final project for CS7643 Deep Learning course at Georgia Tech.
