YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

Yuming Chen; Xinbin Yuan; Jiabao Wang; Ruiqi Wu; Xiang Li; Qibin Hou; Ming-Ming Cheng

YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

Yuming Chen, Xinbin Yuan, Jiabao Wang, Ruiqi Wu, Xiang Li, Qibin Hou, Ming-Ming Cheng

TL;DR

YOLO-MS introduces MS-Block, Global Query Learning, and a Heterogeneous Kernel Size protocol to rethink multi-scale feature learning in real-time detectors. By dynamically weighting branch contributions and varying convolution kernel sizes across stages, the approach enhances multi-scale representations while maintaining real-time speeds. Empirical results on MS COCO show YOLO-MS and its variants surpass recent real-time detectors with favorable parameter and FLO characteristics, and the method generalizes to other YOLO models and tasks such as instance segmentation and rotated object detection. The contributions provide a practical, plug-and-play strategy to improve real-time detection across diverse applications and edge devices.

Abstract

We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS. The core design is based on a series of investigations on how multi-branch features of the basic block and convolutions with different kernel sizes affect the detection performance of objects at different scales. The outcome is a new strategy that can significantly enhance multi-scale feature representations of real-time object detectors. To verify the effectiveness of our work, we train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets, like ImageNet or pre-trained weights. Without bells and whistles, our YOLO-MS outperforms the recent state-of-the-art real-time object detectors, including YOLO-v7, RTMDet, and YOLO-v8. Taking the XS version of YOLO-MS as an example, it can achieve an AP score of 42+% on MS COCO, which is about 2% higher than RTMDet with the same model size. Furthermore, our work can also serve as a plug-and-play module for other YOLO models. Typically, our method significantly advances the APs, APl, and AP of YOLOv8-N from 18%+, 52%+, and 37%+ to 20%+, 55%+, and 40%+, respectively, with even fewer parameters and MACs. Code and trained models are publicly available at https://github.com/FishAndWasabi/YOLO-MS. We also provide the Jittor version at https://github.com/NK-JittorCV/nk-yolo.

YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

TL;DR

Abstract

Paper Structure (17 sections, 4 equations, 8 figures, 15 tables)

This paper contains 17 sections, 4 equations, 8 figures, 15 tables.

Introduction
Related Work
Real-Time Object Detection
Multi-Scale Feature Representation Learning
Methodology
Rethinking Multi-Scale Feature Learning in Basic Building Blocks
Global Query Learning
Heterogeneous Kernel Size Selection Protocol
Architecture
Experiments
Experiment Setup
Analysis of GQL
Analysis of HKS Protocol
Ablation Study
Comparison with the State-of-the-Arts
...and 2 more sections

Figures (8)

Figure 1: Branch feature diversity for different YOLO models. For simplicity, only the feature visualization results of the two branches are presented, and it suffices to show the effectiveness of our method in enriching feature diversity. In the table, $\mathcal{D}$ is an intuitive indicator to measure the diversity of detectors' inter-branch features.
Figure 2: Comparisons with other state-of-the-art real-time object detectors on the MS COCO dataset lin2014microsoft. (a) AP$_{}$ performance v.s. #parameters. (b) AP$_{}$ performance v.s. #computations (MACs). The input size utilized to compute the MACs is $640 \times 640$. Our proposed YOLO-MS achieves the best trade-off between performance and computations.
Figure 3: (a) Architecture of fundamental blocks widely used in the previous YOLO models, e.g., ELAN wang2020cspnet series or CSP wang2020cspnet series. (b) Architecture of proposed MS-Block. $n$ refers to the number of modules (we use the inverted bottleneck module sandler2018mobilenetv2). $Q$ denotes the global query used in the GQL.
Figure 4: Visualization of feature maps for method w/ and w/o GQL. We showcase examples from different images for each branch to further highlight the effectiveness of our GQL in improving the localization accuracy of feature extraction. The right part depicts the method with our GQL.
Figure 5: (a) Area ratio of high feature activation value (>0.5) within the GT box for the branch corresponding to small and large objects. (b) Average Precision comparison for objects at different scales. The red color represents the method with GQL.
...and 3 more figures

YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

TL;DR

Abstract

YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)