Table of Contents
Fetching ...

End-to-End Object Detection with Fully Convolutional Network

Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, Nanning Zheng

TL;DR

This work tackles the limitation of non-end-to-end training in fully convolutional object detectors by proposing a Prediction-aware One-to-One (POTO) label assignment and a differentiable 3D Max Filtering (3DMF) to eliminate the need for NMS. A supplementary auxiliary loss helps maintain supervision when one-to-one assignment reduces foreground samples. The approach demonstrates competitive COCO performance and substantial gains in crowded scenes such as CrowdHuman, showing that end-to-end, NMS-free detection is achievable with a fully conv architecture. The contributions include a principled assignment quality measure, a multi-scale local suppression mechanism, and empirical evidence of improved recall and reduced duplicates without post-processing.

Abstract

Mainstream object detectors based on the fully convolutional network has achieved impressive performance. While most of them still need a hand-designed non-maximum suppression (NMS) post-processing, which impedes fully end-to-end training. In this paper, we give the analysis of discarding NMS, where the results reveal that a proper label assignment plays a crucial role. To this end, for fully convolutional detectors, we introduce a Prediction-aware One-To-One (POTO) label assignment for classification to enable end-to-end detection, which obtains comparable performance with NMS. Besides, a simple 3D Max Filtering (3DMF) is proposed to utilize the multi-scale features and improve the discriminability of convolutions in the local region. With these techniques, our end-to-end framework achieves competitive performance against many state-of-the-art detectors with NMS on COCO and CrowdHuman datasets. The code is available at https://github.com/Megvii-BaseDetection/DeFCN .

End-to-End Object Detection with Fully Convolutional Network

TL;DR

This work tackles the limitation of non-end-to-end training in fully convolutional object detectors by proposing a Prediction-aware One-to-One (POTO) label assignment and a differentiable 3D Max Filtering (3DMF) to eliminate the need for NMS. A supplementary auxiliary loss helps maintain supervision when one-to-one assignment reduces foreground samples. The approach demonstrates competitive COCO performance and substantial gains in crowded scenes such as CrowdHuman, showing that end-to-end, NMS-free detection is achievable with a fully conv architecture. The contributions include a principled assignment quality measure, a multi-scale local suppression mechanism, and empirical evidence of improved recall and reduced duplicates without post-processing.

Abstract

Mainstream object detectors based on the fully convolutional network has achieved impressive performance. While most of them still need a hand-designed non-maximum suppression (NMS) post-processing, which impedes fully end-to-end training. In this paper, we give the analysis of discarding NMS, where the results reveal that a proper label assignment plays a crucial role. To this end, for fully convolutional detectors, we introduce a Prediction-aware One-To-One (POTO) label assignment for classification to enable end-to-end detection, which obtains comparable performance with NMS. Besides, a simple 3D Max Filtering (3DMF) is proposed to utilize the multi-scale features and improve the discriminability of convolutions in the local region. With these techniques, our end-to-end framework achieves competitive performance against many state-of-the-art detectors with NMS on COCO and CrowdHuman datasets. The code is available at https://github.com/Megvii-BaseDetection/DeFCN .

Paper Structure

This paper contains 23 sections, 5 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: As shown in the dashed box, most detectors based on the fully convolutional network adopt multiple predictions and NMS post-processing for each instance. With the proposed prediction-aware one-to-one label assignment and 3D Max Filtering, our end-to-end detector can directly perform a single prediction for each instance without post-processing.
  • Figure 2: The diagram of the head with 3D Max Filtering (3DMF) in a FPN stage. 'POTO' indicates the proposed Prediction-aware One-to-one Label Assignment rule to achieve end-to-end detection. 'Conv + $\sigma$' denotes a convolution layer followed by a sigmoid function han1995influence, which outputs coarsely classification scores. 'Aux Loss' is the proposed auxiliary loss to improve feature representation. The dotted lines are used to highlight the additional components in the training phase, which are abandoned in the inference phase.
  • Figure 3: The diagram of 3D Max Filtering. The detailed procedure of 3D max filtering is illustrated in the dashed box. 'GN' and '$\sigma$' indicate the group normalization wu2018group and the sigmoid activation function, respectively.
  • Figure 4: Visualization of the predicted classification scores from different approaches. The input image has three instances of different scales, i.e., person, tie and pot. The heatmaps from left to right of each approach correspond to the score map in the FPN stage 'P5', 'P6' and 'P7', respectively. 'Aux' indicates the proposed auxiliary loss. Our POTO based detector significantly suppresses the duplicate predictions against the vanilla FCOS framework. The 3DMF enhances the distinctiveness of the local region across adjacent scales. Besides, the auxiliary loss can further improve the feature representation.
  • Figure 5: The comparison graphs of performance w.r.t. training duration. The value of the horizontal axis corresponds to the training iterations. All the models are based on the ResNet-50 backbone. The threshold of NMS is set to 0.6.
  • ...and 2 more figures