Table of Contents
Fetching ...

Union-over-Intersections: Object Detection beyond Winner-Takes-All

Aritra Bhowmik, Pascal Mettes, Martin R. Oswald, Cees G. M. Snoek

TL;DR

This work targets fundamental localization bottlenecks in object detection by reframing how proposals are regressed and merged. Instead of forcing each proposal to align with the full ground-truth box, the method regresses only to the intersection between proposal and ground truth, and then merges multiple proposals by unioning their intersections (UoI), rather than discarding all but the top candidate. The approach is designed to be plug-and-play across proposal-, grid-, and query-based detectors, improving localization and instance segmentation across COCO and VOC with only modest overhead and minimal changes to existing pipelines. The results demonstrate that cooperative use of multiple proposals yields stronger localization signals, with robustness to proposal quality and compatibility with IoU-based losses, making UoI a practical enhancement for diverse detection tasks.

Abstract

This paper revisits the problem of predicting box locations in object detection architectures. Typically, each box proposal or box query aims to directly maximize the intersection-over-union score with the ground truth, followed by a winner-takes-all non-maximum suppression where only the highest scoring box in each region is retained. We observe that both steps are sub-optimal: the first involves regressing proposals to the entire ground truth, which is a difficult task even with large receptive fields, and the second neglects valuable information from boxes other than the top candidate. Instead of regressing proposals to the whole ground truth, we propose a simpler approach: regress only to the area of intersection between the proposal and the ground truth. This avoids the need for proposals to extrapolate beyond their visual scope, improving localization accuracy. Rather than adopting a winner-takes-all strategy, we take the union over the regressed intersections of all boxes in a region to generate the final box outputs. Our plug-and-play method integrates seamlessly into proposal-based, grid-based, and query-based detection architectures with minimal modifications, consistently improving object localization and instance segmentation. We demonstrate its broad applicability and versatility across various detection and segmentation tasks.

Union-over-Intersections: Object Detection beyond Winner-Takes-All

TL;DR

This work targets fundamental localization bottlenecks in object detection by reframing how proposals are regressed and merged. Instead of forcing each proposal to align with the full ground-truth box, the method regresses only to the intersection between proposal and ground truth, and then merges multiple proposals by unioning their intersections (UoI), rather than discarding all but the top candidate. The approach is designed to be plug-and-play across proposal-, grid-, and query-based detectors, improving localization and instance segmentation across COCO and VOC with only modest overhead and minimal changes to existing pipelines. The results demonstrate that cooperative use of multiple proposals yields stronger localization signals, with robustness to proposal quality and compatibility with IoU-based losses, making UoI a practical enhancement for diverse detection tasks.

Abstract

This paper revisits the problem of predicting box locations in object detection architectures. Typically, each box proposal or box query aims to directly maximize the intersection-over-union score with the ground truth, followed by a winner-takes-all non-maximum suppression where only the highest scoring box in each region is retained. We observe that both steps are sub-optimal: the first involves regressing proposals to the entire ground truth, which is a difficult task even with large receptive fields, and the second neglects valuable information from boxes other than the top candidate. Instead of regressing proposals to the whole ground truth, we propose a simpler approach: regress only to the area of intersection between the proposal and the ground truth. This avoids the need for proposals to extrapolate beyond their visual scope, improving localization accuracy. Rather than adopting a winner-takes-all strategy, we take the union over the regressed intersections of all boxes in a region to generate the final box outputs. Our plug-and-play method integrates seamlessly into proposal-based, grid-based, and query-based detection architectures with minimal modifications, consistently improving object localization and instance segmentation. We demonstrate its broad applicability and versatility across various detection and segmentation tasks.
Paper Structure (15 sections, 5 equations, 6 figures, 12 tables)

This paper contains 15 sections, 5 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Union-over-Intersections vs. Winner-takes-all. We introduce two simple modifications to the traditional object detection pipeline. First, in the regression stage, rather than requiring proposals to align with the entire ground truth, we adjust the targets to focus solely on their intersection with the ground truth. Second, in the post-processing stage, we perform Union-over-Intersections over the traditional practice of discarding less optimal bounding boxes. Our approach underscores the advantage of cooperative interaction among proposals, demonstrating that collaboration yields superior results over competitive exclusion.
  • Figure 2: Pseudo code demonstrating our minimal changes in the object detection pipeline. During regression, we adjust the target of the proposals from the entire ground truth to only the intersection with ground truth. In post-processing, we group boxes by proposal rather than regressed outcomes and merge regressed intersections, avoiding the discard of non-maximum boxes.
  • Figure 3: Overview of four ablation studies on MS-COCO with Faster R-CNN. (a) traditional object detection struggles when the best proposal has a low initial overlap with the ground truth, whereas our Union-over-Intersection, which looks beyond the winner-takes-all, is more robust to variations in proposal quality. (b) regressing to intersections is a simpler task to optimize as is evident from the lower loss convergence. (c) the optimal number of top proposals per group for the union-over-intersections is five. (d) as object detector classification performance improves, there are more correctly classified proposals that each cover parts of the ground truth. Instead of selecting a single proposal and discarding the rest, it is more effective to use all proposals to create a more comprehensive representation of the object.
  • Figure 4: Qualitative analysis showcasing the effectiveness of our method applied to Faster R-CNN for object detection. Starting from the left, our method successfully removes the incorrect 'person' prediction, precisely localizes the entire bird, and identifies the third cat as a distinct entity. However, in scenarios with clutter or a lot of overlap, as observed in the right image, our approach also consolidates multiple parrots into a single detection.
  • Figure 5: Qualitative analysis showcasing the effectiveness of our method applied to Faster R-CNN for object detection. Starting from the left, our UoI approach diminishes false positives and achieves tighter localization for both the bird and the cow, outperforming the baseline method in detecting large objects. However, in scenarios with clutter or significant overlap, our method may also aggregate objects into a single group, as seen in the rightmost image with the two cars grouped as one in the right corner.
  • ...and 1 more figures