Table of Contents
Fetching ...

YOLO-Former: YOLO Shakes Hand With ViT

Javad Khoramdel, Ahmad Moori, Yasamin Borhani, Armin Ghanbarzadeh, Esmaeil Najafi

TL;DR

This paper tackles the trade-off between accuracy and speed in real-time object detection by augmenting YOLOv4 with transformer-inspired attention. It introduces YOLO-Former, featuring a Convolutional Transformer Module and a Convolutional Self-Attention Module (CSAM) integrated into the backbone, along with extensive augmentations and regularizations. Empirical results on Pascal VOC (augmented with COCO data) show mAP improvements up to around 86% and near real-time FPS, with single-head CSAM providing the best speed-accuracy balance and MHMB achieving the highest accuracy. The work demonstrates the viability of incorporating ViT-like attention into one-stage detectors and highlights data augmentation and regularization as crucial factors for unlocking gains.

Abstract

The proposed YOLO-Former method seamlessly integrates the ideas of transformer and YOLOv4 to create a highly accurate and efficient object detection system. The method leverages the fast inference speed of YOLOv4 and incorporates the advantages of the transformer architecture through the integration of convolutional attention and transformer modules. The results demonstrate the effectiveness of the proposed approach, with a mean average precision (mAP) of 85.76\% on the Pascal VOC dataset, while maintaining high prediction speed with a frame rate of 10.85 frames per second. The contribution of this work lies in the demonstration of how the innovative combination of these two state-of-the-art techniques can lead to further improvements in the field of object detection.

YOLO-Former: YOLO Shakes Hand With ViT

TL;DR

This paper tackles the trade-off between accuracy and speed in real-time object detection by augmenting YOLOv4 with transformer-inspired attention. It introduces YOLO-Former, featuring a Convolutional Transformer Module and a Convolutional Self-Attention Module (CSAM) integrated into the backbone, along with extensive augmentations and regularizations. Empirical results on Pascal VOC (augmented with COCO data) show mAP improvements up to around 86% and near real-time FPS, with single-head CSAM providing the best speed-accuracy balance and MHMB achieving the highest accuracy. The work demonstrates the viability of incorporating ViT-like attention into one-stage detectors and highlights data augmentation and regularization as crucial factors for unlocking gains.

Abstract

The proposed YOLO-Former method seamlessly integrates the ideas of transformer and YOLOv4 to create a highly accurate and efficient object detection system. The method leverages the fast inference speed of YOLOv4 and incorporates the advantages of the transformer architecture through the integration of convolutional attention and transformer modules. The results demonstrate the effectiveness of the proposed approach, with a mean average precision (mAP) of 85.76\% on the Pascal VOC dataset, while maintaining high prediction speed with a frame rate of 10.85 frames per second. The contribution of this work lies in the demonstration of how the innovative combination of these two state-of-the-art techniques can lead to further improvements in the field of object detection.
Paper Structure (15 sections, 5 figures, 2 tables)

This paper contains 15 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The transformer layer structure for (a) Yolov4 and (b) YOLO-Former. Each module initially pre-processes the input features and feeds it into the specialized attention module. The sum of attention layer output and the input features are processed differently to obtain the output.
  • Figure 2: Attention modules used in (a) Yolov4 and (b) YOLO-Former. Each input is divided to three branches: Key, Query, and Value. These branches are then processed and multiplied according to the multiplication convention stated in each algorithm. They are then processed in a final stage and output (They then go through one final stage of processing before being output). (c) Potential building blocks used in the YOLO-Former attention module for the model iterations described in the paper: single-head (c1), multi-branch (c2), and multi-head (c3)
  • Figure 3: The process of two augmentations (a) constrained rotation and (b) zoom out, being applied to sample images from the dataset and maintain all parts of the image as well as the original size.
  • Figure 4: Comparison between YOLOv4 and YOLO-Former's average precision (AP) on Pascal VOC classes.
  • Figure 5: The mean average precision (mAP) of YOLOv4 and YOLO-Former compared to the top-performing models on the Pascal VOC test set according to papers with code paperswithcode.