PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices
Guanghua Yu, Qinyao Chang, Wenyu Lv, Chang Xu, Cheng Cui, Wei Ji, Qingqing Dang, Kaipeng Deng, Guanzhong Wang, Yuning Du, Baohua Lai, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, Yanjun Ma
TL;DR
The paper tackles the accuracy-latency trade-off in mobile object detection by introducing PP-PicoDet, an anchor-free, lightweight detector that combines an ESNet backbone, a CSP-PAN neck, and a four-scale detector head optimized with a dynamic SimOTA label assignment and a mixed loss (Varifocal Loss and GIoU with Distribution Focal Loss) via a one-shot NAS pipeline. The resulting PicoDet-S and PicoDet-L achieve state-of-the-art performance for mobile detectors, with 30.6% and 40.9% mAP on COCO respectively, while maintaining high frame rates (e.g., 123 FPS on ARM CPU for PicoDet-S) and small parameter footprints (0.99M for S, 3.3M for L). These gains stem from architectural innovations (ESNet, CSP-PAN, depthwise separable convolutions), training strategies (dynamic label assignment, EMA-based regularization), and automated backbone search. The work demonstrates that carefully designed lightweight backbones and necks, together with efficient training objectives, can substantially close the gap to heavier models in real-world mobile scenarios, with accessible code and pretrained models for reproducibility.
Abstract
The better accuracy and efficiency trade-off has been a challenging problem in object detection. In this work, we are dedicated to studying key optimizations and neural network architecture choices for object detection to improve accuracy and efficiency. We investigate the applicability of the anchor-free strategy on lightweight object detection models. We enhance the backbone structure and design the lightweight structure of the neck, which improves the feature extraction ability of the network. We improve label assignment strategy and loss function to make training more stable and efficient. Through these optimizations, we create a new family of real-time object detectors, named PP-PicoDet, which achieves superior performance on object detection for mobile devices. Our models achieve better trade-offs between accuracy and latency compared to other popular models. PicoDet-S with only 0.99M parameters achieves 30.6% mAP, which is an absolute 4.8% improvement in mAP while reducing mobile CPU inference latency by 55% compared to YOLOX-Nano, and is an absolute 7.1% improvement in mAP compared to NanoDet. It reaches 123 FPS (150 FPS using Paddle Lite) on mobile ARM CPU when the input size is 320. PicoDet-L with only 3.3M parameters achieves 40.9% mAP, which is an absolute 3.7% improvement in mAP and 44% faster than YOLOv5s. As shown in Figure 1, our models far outperform the state-of-the-art results for lightweight object detection. Code and pre-trained models are available at https://github.com/PaddlePaddle/PaddleDetection.
