Table of Contents
Fetching ...

Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer

Wenkai Gong

TL;DR

To enable real-time object detection on mobile devices, the paper optimizes YOLOv7-tiny by integrating ShuffleNet v2 and Vision Transformer through two modules, DGSM and DGST. The DGSM reduces parameters with dynamic grouped convolution and channel shuffle, while DGST fuses Vision Transformer-style context in the neck and reduces detection heads, yielding a lightweight, fast detector. The combination DGST+DGSM achieves the best balance of speed and accuracy (e.g., 2.02M parameters, 136.8 ms inference, $mAP_{0.5}=0.861$ and $F1=0.8483$ on a 1919-image dataset). This work demonstrates practical viability for resource-constrained deployment and outlines future refinements for broader mobile scenarios.

Abstract

As mobile computing technology rapidly evolves, deploying efficient object detection algorithms on mobile devices emerges as a pivotal research area in computer vision. This study zeroes in on optimizing the YOLOv7 algorithm to boost its operational efficiency and speed on mobile platforms while ensuring high accuracy. Leveraging a synergy of advanced techniques such as Group Convolution, ShuffleNetV2, and Vision Transformer, this research has effectively minimized the model's parameter count and memory usage, streamlined the network architecture, and fortified the real-time object detection proficiency on resource-constrained devices. The experimental outcomes reveal that the refined YOLO model demonstrates exceptional performance, markedly enhancing processing velocity while sustaining superior detection accuracy.

Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer

TL;DR

To enable real-time object detection on mobile devices, the paper optimizes YOLOv7-tiny by integrating ShuffleNet v2 and Vision Transformer through two modules, DGSM and DGST. The DGSM reduces parameters with dynamic grouped convolution and channel shuffle, while DGST fuses Vision Transformer-style context in the neck and reduces detection heads, yielding a lightweight, fast detector. The combination DGST+DGSM achieves the best balance of speed and accuracy (e.g., 2.02M parameters, 136.8 ms inference, and on a 1919-image dataset). This work demonstrates practical viability for resource-constrained deployment and outlines future refinements for broader mobile scenarios.

Abstract

As mobile computing technology rapidly evolves, deploying efficient object detection algorithms on mobile devices emerges as a pivotal research area in computer vision. This study zeroes in on optimizing the YOLOv7 algorithm to boost its operational efficiency and speed on mobile platforms while ensuring high accuracy. Leveraging a synergy of advanced techniques such as Group Convolution, ShuffleNetV2, and Vision Transformer, this research has effectively minimized the model's parameter count and memory usage, streamlined the network architecture, and fortified the real-time object detection proficiency on resource-constrained devices. The experimental outcomes reveal that the refined YOLO model demonstrates exceptional performance, markedly enhancing processing velocity while sustaining superior detection accuracy.
Paper Structure (13 sections, 3 figures, 4 tables)