YOLO-Former: YOLO Shakes Hand With ViT
Javad Khoramdel, Ahmad Moori, Yasamin Borhani, Armin Ghanbarzadeh, Esmaeil Najafi
TL;DR
This paper tackles the trade-off between accuracy and speed in real-time object detection by augmenting YOLOv4 with transformer-inspired attention. It introduces YOLO-Former, featuring a Convolutional Transformer Module and a Convolutional Self-Attention Module (CSAM) integrated into the backbone, along with extensive augmentations and regularizations. Empirical results on Pascal VOC (augmented with COCO data) show mAP improvements up to around 86% and near real-time FPS, with single-head CSAM providing the best speed-accuracy balance and MHMB achieving the highest accuracy. The work demonstrates the viability of incorporating ViT-like attention into one-stage detectors and highlights data augmentation and regularization as crucial factors for unlocking gains.
Abstract
The proposed YOLO-Former method seamlessly integrates the ideas of transformer and YOLOv4 to create a highly accurate and efficient object detection system. The method leverages the fast inference speed of YOLOv4 and incorporates the advantages of the transformer architecture through the integration of convolutional attention and transformer modules. The results demonstrate the effectiveness of the proposed approach, with a mean average precision (mAP) of 85.76\% on the Pascal VOC dataset, while maintaining high prediction speed with a frame rate of 10.85 frames per second. The contribution of this work lies in the demonstration of how the innovative combination of these two state-of-the-art techniques can lead to further improvements in the field of object detection.
