Table of Contents
Fetching ...

YOLO-TLA: An Efficient and Lightweight Small Object Detection Model based on YOLOv5

Chun-Lin Ji, Tao Yu, Peng Gao, Fei Wang, Ru-Yue Yuan

TL;DR

This work targets reliable small-object detection under resource constraints by building on YOLOv5 to create YOLO-TLA. The key ideas are a tiny detection layer in the neck to better capture tiny objects, a lightweight C3CrossCovn module in the backbone to reduce parameters and compute, and a GAM-based global attention mechanism to enhance feature weighting at multiple backbone points. The authors also explore Ghost-based and CrossConv lightweight modules, finding that a combination of tiny-layer design, CrossCovn backbone, and GAM offers the best trade-off, achieving $mAP@0.5$ gains of $4.6\%$ and $mAP@0.5:0.95$ gains of $4\%$ on MS COCO with a compact parameter footprint ($9.49$M for YOLOv5s and $27.53$M for YOLOv5m). The approach demonstrates strong improvements over the baseline and competitive performance against state-of-the-art detectors, validating its practicality for embedded and real-time small-object detection tasks.

Abstract

Object detection, a crucial aspect of computer vision, has seen significant advancements in accuracy and robustness. Despite these advancements, practical applications still face notable challenges, primarily the inaccurate detection or missed detection of small objects. In this paper, we propose YOLO-TLA, an advanced object detection model building on YOLOv5. We first introduce an additional detection layer for small objects in the neck network pyramid architecture, thereby producing a feature map of a larger scale to discern finer features of small objects. Further, we integrate the C3CrossCovn module into the backbone network. This module uses sliding window feature extraction, which effectively minimizes both computational demand and the number of parameters, rendering the model more compact. Additionally, we have incorporated a global attention mechanism into the backbone network. This mechanism combines the channel information with global information to create a weighted feature map. This feature map is tailored to highlight the attributes of the object of interest, while effectively ignoring irrelevant details. In comparison to the baseline YOLOv5s model, our newly developed YOLO-TLA model has shown considerable improvements on the MS COCO validation dataset, with increases of 4.6% in mAP@0.5 and 4% in mAP@0.5:0.95, all while keeping the model size compact at 9.49M parameters. Further extending these improvements to the YOLOv5m model, the enhanced version exhibited a 1.7% and 1.9% increase in mAP@0.5 and mAP@0.5:0.95, respectively, with a total of 27.53M parameters. These results validate the YOLO-TLA model's efficient and effective performance in small object detection, achieving high accuracy with fewer parameters and computational demands.

YOLO-TLA: An Efficient and Lightweight Small Object Detection Model based on YOLOv5

TL;DR

This work targets reliable small-object detection under resource constraints by building on YOLOv5 to create YOLO-TLA. The key ideas are a tiny detection layer in the neck to better capture tiny objects, a lightweight C3CrossCovn module in the backbone to reduce parameters and compute, and a GAM-based global attention mechanism to enhance feature weighting at multiple backbone points. The authors also explore Ghost-based and CrossConv lightweight modules, finding that a combination of tiny-layer design, CrossCovn backbone, and GAM offers the best trade-off, achieving gains of and gains of on MS COCO with a compact parameter footprint (M for YOLOv5s and M for YOLOv5m). The approach demonstrates strong improvements over the baseline and competitive performance against state-of-the-art detectors, validating its practicality for embedded and real-time small-object detection tasks.

Abstract

Object detection, a crucial aspect of computer vision, has seen significant advancements in accuracy and robustness. Despite these advancements, practical applications still face notable challenges, primarily the inaccurate detection or missed detection of small objects. In this paper, we propose YOLO-TLA, an advanced object detection model building on YOLOv5. We first introduce an additional detection layer for small objects in the neck network pyramid architecture, thereby producing a feature map of a larger scale to discern finer features of small objects. Further, we integrate the C3CrossCovn module into the backbone network. This module uses sliding window feature extraction, which effectively minimizes both computational demand and the number of parameters, rendering the model more compact. Additionally, we have incorporated a global attention mechanism into the backbone network. This mechanism combines the channel information with global information to create a weighted feature map. This feature map is tailored to highlight the attributes of the object of interest, while effectively ignoring irrelevant details. In comparison to the baseline YOLOv5s model, our newly developed YOLO-TLA model has shown considerable improvements on the MS COCO validation dataset, with increases of 4.6% in mAP@0.5 and 4% in mAP@0.5:0.95, all while keeping the model size compact at 9.49M parameters. Further extending these improvements to the YOLOv5m model, the enhanced version exhibited a 1.7% and 1.9% increase in mAP@0.5 and mAP@0.5:0.95, respectively, with a total of 27.53M parameters. These results validate the YOLO-TLA model's efficient and effective performance in small object detection, achieving high accuracy with fewer parameters and computational demands.
Paper Structure (22 sections, 8 equations, 11 figures, 8 tables)

This paper contains 22 sections, 8 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview network architecture of the proposed YOLO-TLA. In the figure, $k$, $s$, and $p$ indicate the convolutional kernel size, stride, and padding size, respectively. The light green regions show our main improvements to the baseline method YOLOv5.
  • Figure 2: Overview of the C3 module. In the figure, $k$ and $s$ indicate the convolutional kernel size and stride, respectively.
  • Figure 3: The pipeline of the C3Ghost module.
  • Figure 4: Illustration of the CrossCovn module with stride of 1 and kernel size of 3
  • Figure 5: The pipeline of the C3CrossCovn module.
  • ...and 6 more figures