Table of Contents
Fetching ...

Fractional Correspondence Framework in Detection Transformer

Masoumeh Zareapoor, Pourya Shamsolmoali, Huiyu Zhou, Yue Lu, Salvador García

TL;DR

This work addresses the rigid one-to-one matching in DETR by introducing Regularized Transport Plan (RTP), which leverages entropy-regularized optimal transport via Sinkhorn to produce soft, fractional matches between predictions and ground truths. By relaxing marginal constraints with KL terms and entropy, RTP captures object density and distribution more faithfully than the Hungarian approach, improving convergence and detection, especially for small or densely packed objects. Empirical results on COCO and VOC show RTP-DETR outperforming several DETR variants and synergizing with IoU-aware losses, with notable gains across AP metrics and faster training. The method offers a practical, scalable alternative for end-to-end object detection and points to future extensions toward zero-shot detection using transferable transport plans.

Abstract

The Detection Transformer (DETR), by incorporating the Hungarian algorithm, has significantly simplified the matching process in object detection tasks. This algorithm facilitates optimal one-to-one matching of predicted bounding boxes to ground-truth annotations during training. While effective, this strict matching process does not inherently account for the varying densities and distributions of objects, leading to suboptimal correspondences such as failing to handle multiple detections of the same object or missing small objects. To address this, we propose the Regularized Transport Plan (RTP). RTP introduces a flexible matching strategy that captures the cost of aligning predictions with ground truths to find the most accurate correspondences between these sets. By utilizing the differentiable Sinkhorn algorithm, RTP allows for soft, fractional matching rather than strict one-to-one assignments. This approach enhances the model's capability to manage varying object densities and distributions effectively. Our extensive evaluations on the MS-COCO and VOC benchmarks demonstrate the effectiveness of our approach. RTP-DETR, surpassing the performance of the Deform-DETR and the recently introduced DINO-DETR, achieving absolute gains in mAP of +3.8% and +1.7%, respectively.

Fractional Correspondence Framework in Detection Transformer

TL;DR

This work addresses the rigid one-to-one matching in DETR by introducing Regularized Transport Plan (RTP), which leverages entropy-regularized optimal transport via Sinkhorn to produce soft, fractional matches between predictions and ground truths. By relaxing marginal constraints with KL terms and entropy, RTP captures object density and distribution more faithfully than the Hungarian approach, improving convergence and detection, especially for small or densely packed objects. Empirical results on COCO and VOC show RTP-DETR outperforming several DETR variants and synergizing with IoU-aware losses, with notable gains across AP metrics and faster training. The method offers a practical, scalable alternative for end-to-end object detection and points to future extensions toward zero-shot detection using transferable transport plans.

Abstract

The Detection Transformer (DETR), by incorporating the Hungarian algorithm, has significantly simplified the matching process in object detection tasks. This algorithm facilitates optimal one-to-one matching of predicted bounding boxes to ground-truth annotations during training. While effective, this strict matching process does not inherently account for the varying densities and distributions of objects, leading to suboptimal correspondences such as failing to handle multiple detections of the same object or missing small objects. To address this, we propose the Regularized Transport Plan (RTP). RTP introduces a flexible matching strategy that captures the cost of aligning predictions with ground truths to find the most accurate correspondences between these sets. By utilizing the differentiable Sinkhorn algorithm, RTP allows for soft, fractional matching rather than strict one-to-one assignments. This approach enhances the model's capability to manage varying object densities and distributions effectively. Our extensive evaluations on the MS-COCO and VOC benchmarks demonstrate the effectiveness of our approach. RTP-DETR, surpassing the performance of the Deform-DETR and the recently introduced DINO-DETR, achieving absolute gains in mAP of +3.8% and +1.7%, respectively.

Paper Structure

This paper contains 22 sections, 7 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Learning curves ($\text{AP}$) for RTP-DETR and other DETR variants using a ResNet-50 backbone across different training durations: 12 epochs (short) and 50 epochs (long). Even with fewer epochs, RTP-DETR reaches a higher ($\text{AP}$) compared to other models, indicating its faster convergence and overall superior performance throughout the training process. DETR shows the slowest growth and lowest overall performance, converging at $\sim$42 AP by epoch 50.
  • Figure 2: Cost matrix showing assignment weights ($\Gamma_{ij}$) between predictions $\color{blue}{p_1}, \color{red!30}{p_2}, \color{red}{p_3}, \color{yellow}{p_4}, \color{green}{p_5}$ (dashed colored lines) and ground-truth objects $g_1, g_2, g_3, g_4$ (solid black lines) for different matching strategies: the hungarian algorithm, exact optimal transport (without entropy regularization), and regularized transport plan (RTP). The Hungarian algorithm enforces strict one-to-one assignments, treating excess predictions as background (bg). That means, for each ground-truth object, exactly one prediction is assigned, and all other predictions are ignored (0 in the matrix), and the excess predictions are treated as background (see $\color{green}{p_5}\rightarrow \color{gray}{bg}$). OT without entropy regularization tends to focus heavily on high-cost matches, thus providing hard (one-to-one) mapping; $g_2$ is only matched to $p_2$ while it overlapped with $p_3$ as well. RTP uses entropy regularization to smooth assignments; each ground truth $g_i$ connects to multiple $p_i$ with weights distributed more evenly across the predictions; $g_4$ has significant matches with $p_1$ (0.75) and $p_5$ (0.76). Unlike the Hungarian algorithm, which forces all predictions to be assigned (including to the background), OT can leave predictions unassigned if they lack a sufficiently low-cost match to any ground truth.
  • Figure 3: Effect of using regularization $H(\Gamma)$. The x-axis represents the matches $\Gamma$ (transport plan), which pairs predicted objects with ground-truth objects. The y-axis is the matching cost, with lower values denoting more cost-effective pairings. The regularized transport plan (RTP) (with $\epsilon \neq 0$) shows reduced matching costs, indicating the benefits of the regularization term $H(\Gamma)$ in achieving a smooth distribution of matches. Interestingly, at $\epsilon=0$, the transport plan is sharp and this rigidity forces higher-cost assignments because the algorithm cannot "distribute" assignments to reduce cost.
  • Figure 4: Convergence curves. RTP accelerates the training process for different variants of DETR. The baseline models and our RTP counterparts are shown by dotted and solid lines, respectively. The horizontal axis denotes the number of epochs, the vertical axis is the AP evaluated on COCO.
  • Figure 5: Inference time vs. accuracy on COCO for various DETR-based models. The x-axis represents inference speed (FPS), where higher values indicate faster models, and the y-axis represents accuracy (AP), where higher values indicate better performance. DETR and DN-DETR achieve relatively low accuracy with similar inference speeds ($\sim$12–13 FPS). Our model, RTP-DETR, achieves the highest accuracy with competitive inference speed, outperforming DINO-DETR in precision, although it exhibits slightly lower efficiency.
  • ...and 2 more figures