Learning non-maximum suppression
Jan Hosang, Rodrigo Benenson, Bernt Schiele
TL;DR
<3-5 sentence high-level summary> The paper tackles the limitation of hand-crafted non-maximum suppression (NMS) in object detectors by introducing a learnable NMS network, Gnet, that jointly reasons over detections to output at most one high-scoring box per object. It designs a loss based on matching detections to ground truth and a gossip-style message-passing architecture that allows detections to exchange information without relying on image content. Empirical results on PETS and COCO demonstrate that Gnet improves over tuned GreedyNMS, especially in occluded scenarios and multi-class settings, suggesting the feasibility of end-to-end detectors without post-processing. The work points toward integrating NMS into learning pipelines and highlights directions for data augmentation and incorporating image features to further enhance performance.
Abstract
Object detectors have hugely profited from moving towards an end-to-end learning paradigm: proposals, features, and the classifier becoming one neural network improved results two-fold on general object detection. One indispensable component is non-maximum suppression (NMS), a post-processing algorithm responsible for merging all detections that belong to the same object. The de facto standard NMS algorithm is still fully hand-crafted, suspiciously simple, and -- being based on greedy clustering with a fixed distance threshold -- forces a trade-off between recall and precision. We propose a new network architecture designed to perform NMS, using only boxes and their score. We report experiments for person detection on PETS and for general object categories on the COCO dataset. Our approach shows promise providing improved localization and occlusion handling.
