Hands-on Evaluation of Visual Transformers for Object Recognition and Detection
Dimitrios N. Vlachogiannis, Dimitrios A. Koutsomitropoulos
TL;DR
Vision Transformers (ViTs) are evaluated across object recognition, detection, and medical imaging to address CNN limitations in global context. The study compares pure, hierarchical, and hybrid ViT architectures (e.g., ViT, Swin, CvT, YOLOS, Deformable DETR) against CNN baselines on ImageNet-1k and COCO, with ChestX-ray14 for medical imaging. Key findings show hybrid and hierarchical ViTs (Swin, CvT) achieve strong accuracy with reasonable compute; pure ViTs can excel but may be heavier; data augmentation significantly boosts small-dataset performance in medical tasks. The work highlights ViTs as competitive or superior alternatives in scenarios requiring global context and offers a reproducible experimental framework.
Abstract
Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally created for language processing, use self-attention mechanisms, which allow them to understand relationships across the entire image. In this paper, we compare different types of ViTs (pure, hierarchical, and hybrid) against traditional CNN models across various tasks, including object recognition, detection, and medical image classification. We conduct thorough tests on standard datasets like ImageNet for image classification and COCO for object detection. Additionally, we apply these models to medical imaging using the ChestX-ray14 dataset. We find that hybrid and hierarchical transformers, especially Swin and CvT, offer a strong balance between accuracy and computational resources. Furthermore, by experimenting with data augmentation techniques on medical images, we discover significant performance improvements, particularly with the Swin Transformer model. Overall, our results indicate that Vision Transformers are competitive and, in many cases, outperform traditional CNNs, especially in scenarios requiring the understanding of global visual contexts like medical imaging.
