Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer

Wenkai Gong

Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer

Wenkai Gong

TL;DR

To enable real-time object detection on mobile devices, the paper optimizes YOLOv7-tiny by integrating ShuffleNet v2 and Vision Transformer through two modules, DGSM and DGST. The DGSM reduces parameters with dynamic grouped convolution and channel shuffle, while DGST fuses Vision Transformer-style context in the neck and reduces detection heads, yielding a lightweight, fast detector. The combination DGST+DGSM achieves the best balance of speed and accuracy (e.g., 2.02M parameters, 136.8 ms inference, $mAP_{0.5}=0.861$ and $F1=0.8483$ on a 1919-image dataset). This work demonstrates practical viability for resource-constrained deployment and outlines future refinements for broader mobile scenarios.

Abstract

As mobile computing technology rapidly evolves, deploying efficient object detection algorithms on mobile devices emerges as a pivotal research area in computer vision. This study zeroes in on optimizing the YOLOv7 algorithm to boost its operational efficiency and speed on mobile platforms while ensuring high accuracy. Leveraging a synergy of advanced techniques such as Group Convolution, ShuffleNetV2, and Vision Transformer, this research has effectively minimized the model's parameter count and memory usage, streamlined the network architecture, and fortified the real-time object detection proficiency on resource-constrained devices. The experimental outcomes reveal that the refined YOLO model demonstrates exceptional performance, markedly enhancing processing velocity while sustaining superior detection accuracy.

Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer

TL;DR

and

on a 1919-image dataset). This work demonstrates practical viability for resource-constrained deployment and outlines future refinements for broader mobile scenarios.

Abstract

Paper Structure (13 sections, 3 figures, 4 tables)

This paper contains 13 sections, 3 figures, 4 tables.

Introduction
Related Work
ShuffleNet v2
Vision Transformer (ViT)
You Only Look Once (YOLO)
YOLO Model Architecture
Model Overview
Dynamic Group Convolution Shuffle Module (DGSM)
Dynamic Group Convolution Shuffle Transformer (DGST)
Experiment
Setups
Analysis
Conclusion

Figures (3)

Figure 1: DGSM
Figure 2: DGST
Figure 3: Reducing the original three detection heads to two

Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer

TL;DR

Abstract

Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (3)