End-to-End Object Detection with Fully Convolutional Network
Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, Nanning Zheng
TL;DR
This work tackles the limitation of non-end-to-end training in fully convolutional object detectors by proposing a Prediction-aware One-to-One (POTO) label assignment and a differentiable 3D Max Filtering (3DMF) to eliminate the need for NMS. A supplementary auxiliary loss helps maintain supervision when one-to-one assignment reduces foreground samples. The approach demonstrates competitive COCO performance and substantial gains in crowded scenes such as CrowdHuman, showing that end-to-end, NMS-free detection is achievable with a fully conv architecture. The contributions include a principled assignment quality measure, a multi-scale local suppression mechanism, and empirical evidence of improved recall and reduced duplicates without post-processing.
Abstract
Mainstream object detectors based on the fully convolutional network has achieved impressive performance. While most of them still need a hand-designed non-maximum suppression (NMS) post-processing, which impedes fully end-to-end training. In this paper, we give the analysis of discarding NMS, where the results reveal that a proper label assignment plays a crucial role. To this end, for fully convolutional detectors, we introduce a Prediction-aware One-To-One (POTO) label assignment for classification to enable end-to-end detection, which obtains comparable performance with NMS. Besides, a simple 3D Max Filtering (3DMF) is proposed to utilize the multi-scale features and improve the discriminability of convolutions in the local region. With these techniques, our end-to-end framework achieves competitive performance against many state-of-the-art detectors with NMS on COCO and CrowdHuman datasets. The code is available at https://github.com/Megvii-BaseDetection/DeFCN .
