Union-over-Intersections: Object Detection beyond Winner-Takes-All
Aritra Bhowmik, Pascal Mettes, Martin R. Oswald, Cees G. M. Snoek
TL;DR
This work targets fundamental localization bottlenecks in object detection by reframing how proposals are regressed and merged. Instead of forcing each proposal to align with the full ground-truth box, the method regresses only to the intersection between proposal and ground truth, and then merges multiple proposals by unioning their intersections (UoI), rather than discarding all but the top candidate. The approach is designed to be plug-and-play across proposal-, grid-, and query-based detectors, improving localization and instance segmentation across COCO and VOC with only modest overhead and minimal changes to existing pipelines. The results demonstrate that cooperative use of multiple proposals yields stronger localization signals, with robustness to proposal quality and compatibility with IoU-based losses, making UoI a practical enhancement for diverse detection tasks.
Abstract
This paper revisits the problem of predicting box locations in object detection architectures. Typically, each box proposal or box query aims to directly maximize the intersection-over-union score with the ground truth, followed by a winner-takes-all non-maximum suppression where only the highest scoring box in each region is retained. We observe that both steps are sub-optimal: the first involves regressing proposals to the entire ground truth, which is a difficult task even with large receptive fields, and the second neglects valuable information from boxes other than the top candidate. Instead of regressing proposals to the whole ground truth, we propose a simpler approach: regress only to the area of intersection between the proposal and the ground truth. This avoids the need for proposals to extrapolate beyond their visual scope, improving localization accuracy. Rather than adopting a winner-takes-all strategy, we take the union over the regressed intersections of all boxes in a region to generate the final box outputs. Our plug-and-play method integrates seamlessly into proposal-based, grid-based, and query-based detection architectures with minimal modifications, consistently improving object localization and instance segmentation. We demonstrate its broad applicability and versatility across various detection and segmentation tasks.
