Table of Contents
Fetching ...

Learning non-maximum suppression

Jan Hosang, Rodrigo Benenson, Bernt Schiele

TL;DR

<3-5 sentence high-level summary> The paper tackles the limitation of hand-crafted non-maximum suppression (NMS) in object detectors by introducing a learnable NMS network, Gnet, that jointly reasons over detections to output at most one high-scoring box per object. It designs a loss based on matching detections to ground truth and a gossip-style message-passing architecture that allows detections to exchange information without relying on image content. Empirical results on PETS and COCO demonstrate that Gnet improves over tuned GreedyNMS, especially in occluded scenarios and multi-class settings, suggesting the feasibility of end-to-end detectors without post-processing. The work points toward integrating NMS into learning pipelines and highlights directions for data augmentation and incorporating image features to further enhance performance.

Abstract

Object detectors have hugely profited from moving towards an end-to-end learning paradigm: proposals, features, and the classifier becoming one neural network improved results two-fold on general object detection. One indispensable component is non-maximum suppression (NMS), a post-processing algorithm responsible for merging all detections that belong to the same object. The de facto standard NMS algorithm is still fully hand-crafted, suspiciously simple, and -- being based on greedy clustering with a fixed distance threshold -- forces a trade-off between recall and precision. We propose a new network architecture designed to perform NMS, using only boxes and their score. We report experiments for person detection on PETS and for general object categories on the COCO dataset. Our approach shows promise providing improved localization and occlusion handling.

Learning non-maximum suppression

TL;DR

<3-5 sentence high-level summary> The paper tackles the limitation of hand-crafted non-maximum suppression (NMS) in object detectors by introducing a learnable NMS network, Gnet, that jointly reasons over detections to output at most one high-scoring box per object. It designs a loss based on matching detections to ground truth and a gossip-style message-passing architecture that allows detections to exchange information without relying on image content. Empirical results on PETS and COCO demonstrate that Gnet improves over tuned GreedyNMS, especially in occluded scenarios and multi-class settings, suggesting the feasibility of end-to-end detectors without post-processing. The work points toward integrating NMS into learning pipelines and highlights directions for data augmentation and incorporating image features to further enhance performance.

Abstract

Object detectors have hugely profited from moving towards an end-to-end learning paradigm: proposals, features, and the classifier becoming one neural network improved results two-fold on general object detection. One indispensable component is non-maximum suppression (NMS), a post-processing algorithm responsible for merging all detections that belong to the same object. The de facto standard NMS algorithm is still fully hand-crafted, suspiciously simple, and -- being based on greedy clustering with a fixed distance threshold -- forces a trade-off between recall and precision. We propose a new network architecture designed to perform NMS, using only boxes and their score. We report experiments for person detection on PETS and for general object categories on the COCO dataset. Our approach shows promise providing improved localization and occlusion handling.

Paper Structure

This paper contains 47 sections, 1 equation, 13 figures, 2 tables.

Figures (13)

  • Figure 1: We propose a non-maximum suppression convnet that will re-score all raw detections (top). Our network is trained end-to-end to learn to generate exactly one high scoring detection per object (bottom, example result).
  • Figure 2: High level diagram of the Gnet. FC denotes fully connected layers. All features in this diagram have 128 dimensions (input vector and features between the layers/blocks), the output is a scalar.
  • Figure 3: One block of our Gnet visualised for one detection. The representation of each detection is reduced and then combined into neighbouring detection pairs and concatenated with detection pair features (hatched boxes, corresponding features and detections have the same colour). Features of detection pairs are mapped independently through fully connected layers. The variable number of pairs is reduced to a fixed-size representation by max-pooling. Pairwise computations are done for each detection independently.
  • Figure 4: Performance on the PETS test set.
  • Figure 5: Performance on the PETS test set for different occlusion ranges.
  • ...and 8 more figures